Week 8 Regression Models

Problem 2. Analysis of Canadian Manufacturing Workers Work Hours As the graph shows the data trending downward and then sharply changing direction back upwards, we wouldn’t use a linear trend model. There is no apparent seasonality it seems from the data but it is hard to tell given the closeness of the years in the graph. There is however a a U-shaped trend similar to what we have seen in the Amtrak data. Therefore, the best solution would be to try a quadratic trend model with seasonality to be able to pick up any possible trends due to season or month.

Problem 3: Toys R US

First we import the data and then graph and examine it from the head to the tail.

library(forecast)
library(ggplot2)
setwd("~/USM_MBA_678")
toys<-read.csv("ToysRUsRevenues.csv", stringsAsFactors = FALSE)
str(toys)

## 'data.frame':    16 obs. of  2 variables:
##  $ QuarterYear          : chr  "Q1-92" "Q2-92" "Q3-92" "Q4-92" ...
##  $ Revenue.in.million...: int  1026 1056 1182 2861 1172 1249 1346 3402 1286 1317 ...

head(toys)

##   QuarterYear Revenue.in.million...
## 1       Q1-92                  1026
## 2       Q2-92                  1056
## 3       Q3-92                  1182
## 4       Q4-92                  2861
## 5       Q1-93                  1172
## 6       Q2-93                  1249

tail(toys)

##    QuarterYear Revenue.in.million...
## 11       Q3-94                  1449
## 12       Q4-94                  3893
## 13       Q1-95                  1462
## 14       Q2-95                  1452
## 15       Q3-95                  1631
## 16       Q4-95                  4200

#Create a time series object out of it
toysTS <- ts(toys$Revenue, start=c(1992, 1),frequency=4)

yrange = range(toysTS)

#set up the plot
plot(c(1992,1995), yrange, type="n",xlab="Year", ylab="Revenue(millions)", bty="l", xaxt="n",yaxt="n")

#Add the time series air
lines(toysTS, bty="l")

#add the x-axis
axis(1, at=seq(1992,1995,1), labels=format(seq(1992,1995,1)))

#add the y-axis
axis(2, at=seq(1000,4500,500), labels=format(seq(1,4.5,.5)), las=2)

3.a - Fit a regression model with a linear trend and seasonal dummy variables. Use the entire series (excluding the last two quarters ) as the training period

ToysTrain<-window(toysTS,1992,c(1995,2))
ToysTrain

##      Qtr1 Qtr2 Qtr3 Qtr4
## 1992 1026 1056 1182 2861
## 1993 1172 1249 1346 3402
## 1994 1286 1317 1449 3893
## 1995 1462 1452

toysLinearSeason <- tslm(ToysTrain~trend+season)
summary(toysLinearSeason)

## 
## Call:
## tslm(formula = ToysTrain ~ trend + season)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -335.90  -54.29   18.50   63.80  319.24 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   906.75     115.35   7.861 2.55e-05 ***
## trend          47.11      11.26   4.185  0.00236 ** 
## season2       -15.11     119.66  -0.126  0.90231    
## season3        89.17     128.67   0.693  0.50582    
## season4      2101.73     129.17  16.272 5.55e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 168.5 on 9 degrees of freedom
## Multiple R-squared:  0.9774, Adjusted R-squared:  0.9673 
## F-statistic: 97.18 on 4 and 9 DF,  p-value: 2.129e-07

A partial output of the regression model is shown in TAble 6.6 where season2 is the Quarter 2 dummy). i. Mention two statistics ( and their values) that measure how well this model fits the training period. The t-value of -0126 for the season2 variable is a measure of how many standard deviations our coefficient estimate is far away from zero. As in this case the value is not far away from zero which indicates a relationship does not exist between season2 and the dependent variable. The p-value for trend is 2.55e-05 and being less than 5% means that it is unlikely we will see a relationship between the trend variable and the dependent variable that is due to chance. ii. Mention two statistics (and their values) that measure the predictive accuracy of this model. The MAPE for the training set which has a value of 5.006914% is the mean absolute percentage error. This measure gives a percentage score of how forecasta deviate (on average) from the actual values. The MAE for the training set which has a value of 92.53061 is the Mean absolute error/deviation and gives the magnitude of the average absolute error. iii.After adjusting for trend, what is the average difference between sales in Q3 and sales in Q1? Quarter 3 has a coefficient of 89.17 which is the difference between it sales and the base sales which is Quarter1.

After adjusting for seasonality, which quarter (Q1,Q2,Q3 or Q4) has the highest average sales? Quarter 4 with a coefficient of 2101.74=3 has the highest average sales.

4.Forecasting Department Store Sales a. In order to fit a regression-based model that accounts for an exponential trend, a logarithm of sales must be performed. b.Fit a regression model with an exponential trend and seasonality, using only the first 20 quarters as the training period.

First we import plot and summarize the data.

#first lets import the data, plot and run a summary
library(forecast)
library(ggplot2)
setwd("~/USM_MBA_678")
DeptSales <- read.csv("DeptStoreSales.csv", header=TRUE, stringsAsFactors=FALSE)
str(DeptSales)

## 'data.frame':    24 obs. of  2 variables:
##  $ Quarter: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Sales  : int  50147 49325 57048 76781 48617 50898 58517 77691 50862 53028 ...

head(DeptSales)

##   Quarter Sales
## 1       1 50147
## 2       2 49325
## 3       3 57048
## 4       4 76781
## 5       5 48617
## 6       6 50898

tail(DeptSales)

##    Quarter  Sales
## 19      19  71486
## 20      20  92183
## 21      21  60800
## 22      22  64900
## 23      23  76997
## 24      24 103337

#time series object
DeptSales.ts <- ts(DeptSales$Sales, start=c(1,1), freq=4)
yrange=range(DeptSales.ts)

#autoplotting
autoplot(DeptSales.ts)

Then we partition the series into training and validation data.

validLength<-4
trainLength<-length(DeptSales.ts)-validLength
trainLength

## [1] 20

SalesTrain<-window(DeptSales.ts,end=c(1,trainLength))
SalesValid<-window(DeptSales.ts,start=c(1,trainLength+1))
SalesTrain

##    Qtr1  Qtr2  Qtr3  Qtr4
## 1 50147 49325 57048 76781
## 2 48617 50898 58517 77691
## 3 50862 53028 58849 79660
## 4 51640 54119 65681 85175
## 5 56405 60031 71486 92183

SalesValid

##     Qtr1   Qtr2   Qtr3   Qtr4
## 6  60800  64900  76997 103337

Then we

#call it modelB (as in 5b) and have the dependent variable transformed
modelB<-tslm(log(SalesTrain) ~ trend + season)
summary(modelB)

## 
## Call:
## tslm(formula = log(SalesTrain) ~ trend + season)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.053524 -0.013199 -0.004527  0.014387  0.062681 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.748945   0.018725 574.057  < 2e-16 ***
## trend        0.011088   0.001295   8.561 3.70e-07 ***
## season2      0.024956   0.020764   1.202    0.248    
## season3      0.165343   0.020884   7.917 9.79e-07 ***
## season4      0.433746   0.021084  20.572 2.10e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.03277 on 15 degrees of freedom
## Multiple R-squared:  0.9791, Adjusted R-squared:  0.9736 
## F-statistic: 175.9 on 4 and 15 DF,  p-value: 2.082e-12

A partial output is shown in Table 6.7. From the output, after adjusting for trend, are Q2 average sales higher, lower or approximately equal to the average Q1 sales? Well, even with the log transform of sales, the coefficient is quite small, but it is positive so the Q2 average sales are close to the Q1 sales only slightly higher.
e this model to forecast sales in Quarter 21 and 22.

# use the named coefficients
q21Forecast <- modelB$coefficients["(Intercept)"] + modelB$coefficients["trend"]*21

q22Forecast <- modelB$coefficients["(Intercept)"] + modelB$coefficients["trend"]*22 + modelB$coefficients["season2"]

#convert them back and simpy echo them
exp(q21Forecast)

## (Intercept) 
##    58793.71

exp(q22Forecast)

## (Intercept) 
##    60951.51

Recreate the fit and forecast models graphically.

#Helps set up the plot
yrange = range(DeptSales.ts)

#set up the plot

plot(c(1,6),c(yrange[1],yrange[2]),type="n",xlab="Year", ylab="Department Sales", bty="l",xaxt="n",yaxt="n")

#add the time series air
lines(DeptSales.ts, bty="l")

#add the x-axis
axis(1, at=seq(1,6,1), labels=format(seq(1,6,1)))
#add the y-axis
axis(2, at=seq(45000,105000,5000), labels=format(seq(45,105,5.0)), las=2)

#add the y-axis

lines(exp(modelB$fitted), col="red")

And the forecast error plot…

forecasterror<-DeptSales.ts-exp(modelB$fitted)
autoplot(forecasterror)

It seems that the forecast errors from midyear of year 4 are always positive and therefore we can expect under-forecasting of Qauters 21 and 22.

given the low value of the forecast errors when compared to the actual sales value, I would say that the regression model fits the data well.
Which of the following solutions is adequate and a aparsimonious solution for improving model fit? You could fit a quadratic trend model to Sales but with an adjusted R squared of already 97+% can we really say that fitting a quadratic trend model is parsimonious. To gain at most a 1 to 2 % increase in the adjusted R squared, aren’t we sacrificing model simplicity?

Problem 5. Souvenir Sales

trend and seasonality
regression model First we import plot and summarize the data.

#first lets import the data, plot and run a summary
library(forecast)
library(ggplot2)
setwd("~/USM_MBA_678")
SouvSales <- read.csv("SouvenirSales.csv", header=TRUE, stringsAsFactors=FALSE)
str(SouvSales)

## 'data.frame':    84 obs. of  2 variables:
##  $ Date : chr  "Jan-95" "Feb-95" "Mar-95" "Apr-95" ...
##  $ Sales: num  1665 2398 2841 3547 3753 ...

head(SouvSales)

##     Date   Sales
## 1 Jan-95 1664.81
## 2 Feb-95 2397.53
## 3 Mar-95 2840.71
## 4 Apr-95 3547.29
## 5 May-95 3752.96
## 6 Jun-95 3714.74

tail(SouvSales)

##      Date     Sales
## 79 Jul-01  26155.15
## 80 Aug-01  28586.52
## 81 Sep-01  30505.41
## 82 Oct-01  30821.33
## 83 Nov-01  46634.38
## 84 Dec-01 104660.67

#time series object
SouvSales.ts <- ts(SouvSales$Sales, start=c(1,1), freq=12)
yrange=range(SouvSales.ts)

#autoplotting
autoplot(SouvSales.ts)

Then we partition the series into training and validation data.

validSouvLength<-12
trainSouvLength<-length(SouvSales.ts)-validSouvLength
trainSouvLength

## [1] 72

SouvTrain<-window(SouvSales.ts,end=c(1,trainSouvLength))
SouvValid<-window(SouvSales.ts,start=c(1,trainSouvLength+1))
SouvTrain

##        Jan      Feb      Mar      Apr      May      Jun      Jul      Aug
## 1  1664.81  2397.53  2840.71  3547.29  3752.96  3714.74  4349.61  3566.34
## 2  2499.81  5198.24  7225.14  4806.03  5900.88  4951.34  6179.12  4752.15
## 3  4717.02  5702.63  9957.58  5304.78  6492.43  6630.80  7349.62  8176.62
## 4  5921.10  5814.58 12421.25  6369.77  7609.12  7224.75  8121.22  7979.25
## 5  4826.64  6470.23  9638.77  8821.17  8722.37 10209.48 11276.55 12552.22
## 6  7615.03  9849.69 14558.40 11587.33  9332.56 13082.09 16732.78 19888.61
##        Sep      Oct      Nov      Dec
## 1  5021.82  6423.48  7600.60 19756.21
## 2  5496.43  5835.10 12600.08 28541.72
## 3  8573.17  9690.50 15151.84 34061.01
## 4  8093.06  8476.70 17914.66 30114.41
## 5 11637.39 13606.89 21822.11 45060.69
## 6 23933.38 25391.35 36024.80 80721.71

SouvValid

##         Jan       Feb       Mar       Apr       May       Jun       Jul
## 7  10243.24  11266.88  21826.84  17357.33  15997.79  18601.53  26155.15
##         Aug       Sep       Oct       Nov       Dec
## 7  28586.52  30505.41  30821.33  46634.38 104660.67

#Linear trend with seasonality
SouvLinearSeason <- tslm(SouvTrain~trend+season)
summary(SouvLinearSeason)

## 
## Call:
## tslm(formula = SouvTrain ~ trend + season)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12592  -2359   -411   1940  33651 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3065.55    2640.26  -1.161  0.25029    
## trend         245.36      34.08   7.199 1.24e-09 ***
## season2      1119.38    3422.06   0.327  0.74474    
## season3      4408.84    3422.56   1.288  0.20272    
## season4      1462.57    3423.41   0.427  0.67077    
## season5      1446.19    3424.60   0.422  0.67434    
## season6      1867.98    3426.13   0.545  0.58766    
## season7      2988.56    3427.99   0.872  0.38684    
## season8      3227.58    3430.19   0.941  0.35058    
## season9      3955.56    3432.73   1.152  0.25384    
## season10     4821.66    3435.61   1.403  0.16573    
## season11    11524.64    3438.82   3.351  0.00141 ** 
## season12    32469.55    3442.36   9.432 2.19e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5927 on 59 degrees of freedom
## Multiple R-squared:  0.7903, Adjusted R-squared:  0.7476 
## F-statistic: 18.53 on 12 and 59 DF,  p-value: 9.435e-16

b.i. December appears to have the highest Souvenir sales. This is reasonable if December is the principal vacation month or has holidays where travel to Australia is likely. b.ii The trend coefficient of model A is that the coefficient represent the average sales difference between that month and January.

c Run a log(Sales) model

#Linear trend with seasonality
SouvModelB <- tslm(log(SouvTrain)~trend+season)
summary(SouvModelB)

## 
## Call:
## tslm(formula = log(SouvTrain) ~ trend + season)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4529 -0.1163  0.0001  0.1005  0.3438 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7.646363   0.084120  90.898  < 2e-16 ***
## trend       0.021120   0.001086  19.449  < 2e-16 ***
## season2     0.282015   0.109028   2.587 0.012178 *  
## season3     0.694998   0.109044   6.374 3.08e-08 ***
## season4     0.373873   0.109071   3.428 0.001115 ** 
## season5     0.421710   0.109109   3.865 0.000279 ***
## season6     0.447046   0.109158   4.095 0.000130 ***
## season7     0.583380   0.109217   5.341 1.55e-06 ***
## season8     0.546897   0.109287   5.004 5.37e-06 ***
## season9     0.635565   0.109368   5.811 2.65e-07 ***
## season10    0.729490   0.109460   6.664 9.98e-09 ***
## season11    1.200954   0.109562  10.961 7.38e-16 ***
## season12    1.952202   0.109675  17.800  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1888 on 59 degrees of freedom
## Multiple R-squared:  0.9424, Adjusted R-squared:  0.9306 
## F-statistic:  80.4 on 12 and 59 DF,  p-value: < 2.2e-16

c.i. fitting a model to log(Sales) with a linear trend is equivalent to fitting a model to Sales with an exponential trend.

c.ii. That there is a .0211 increase in the log(sales) for each month calculated from the base. c.iii. Sales in February 2002 are 5568.

# use the named coefficients
janForecast <- SouvModelB$coefficients["(Intercept)"] + modelB$coefficients["trend"]*85

febForecast <- SouvModelB$coefficients["(Intercept)"] + modelB$coefficients["trend"]*86 + modelB$coefficients["season2"]

#convert them back and simpy echo them
exp(janForecast)

## (Intercept) 
##    5371.333

exp(febForecast)

## (Intercept) 
##    5568.468

Of the two models, model B has an adjusted R squared of 93.06 and therefore explains more of the variance than Model A. In addition, only three of the variables are statistically significant in the first model while all variables are useable in Model B

e)Whatever these components are - which is not clear from the question, I would separate the different components into separate datasets and try to predict sales accordingly. In other words, if the components are different types of products, then I would create multiple models to better understand the trends between those years for the particular product. I could also create dummy variables for the different products which would take into account when the products became discontinued.

Week 8 Regression Models

Thomas Beretich

March 26, 2018