ECO 4051 Financial Econometrics

Assignment 3: Forecasting Target Quarterly Revenues for 2nd Quarter 2022

Collins A. Hackman

March 24, 2022

QUESTION 1

target <- read_excel("C:/Users/kofij/Documents/targetrev.xlsx")
mydata<-target %>% dplyr::select(datadate='Data Date', fqtr='Fiscal Quarter', revtq='Revenue - Total')
mydata<-as.data.frame(mydata)
head(mydata, 3)

    datadate fqtr revtq
1 1995-01-31    4  6998
2 1995-04-30    1  4757
3 1995-07-31    2  5236

QUESTION 2

q1 <- mydata %>% plot_ly(x=~datadate, y=~revtq, type='scatter', mode='line', name='Quarterly Rev')
q2 <- mydata %>% plot_ly(x=~datadate, y=~((revtq - lag(revtq))/lag(revtq)), type='scatter', mode='line', name='Diff QR')
q3 <- mydata %>% plot_ly(x=~datadate, y=~((revtq - lag(revtq))/lag(revtq)), type='scatter', mode='line', name='Log Diff QR')
subplot(q1,q2,q3, nrows=3)%>%
  layout(title = list(text = "TARGET CORPORATION REVENUE"),
         plot_bgcolor='#e5ecf6')

Discussion:

Discuss the results in terms of the existence of a trend, the volatility of the variable, and whether there is evidence of seasonality

The revenue plot trends upwards and there’s a regular pattern in its peaks and troughs indicating seasonality in the quarters (except for the outlier in 2020 which can bias the pattern). Based on the differencing and log plots fluctuations, the revenues’ variance over time is not large.

QUESTION 3

mydata<-mydata %>% dplyr::mutate(trend=1:nrow(mydata),
                         trendsq=trend^2,
                         trendcb=trend^3)

QUESTION 3.1 & 3.2

trend.lin<- lm(revtq~trend, mydata)
summary(trend.lin)


Call:
lm(formula = revtq ~ trend, data = mydata)

Residuals:
    Min      1Q  Median      3Q     Max 
-4121.6 -1372.9  -565.4  1223.0  7654.0 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 5615.924    461.535   12.17   <2e-16 ***
trend        162.624      7.284   22.33   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2393 on 107 degrees of freedom
Multiple R-squared:  0.8233,    Adjusted R-squared:  0.8216 
F-statistic: 498.5 on 1 and 107 DF,  p-value: < 2.2e-16

trend.quad<-lm(revtq~trend+trendsq, data=mydata)
summary(trend.quad)


Call:
lm(formula = revtq ~ trend + trendsq, data = mydata)

Residuals:
    Min      1Q  Median      3Q     Max 
-3981.1 -1284.9  -581.4  1656.0  8163.7 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 5077.4425   700.2068   7.251 7.05e-11 ***
trend        191.7313    29.3845   6.525 2.39e-09 ***
trendsq       -0.2646     0.2588  -1.022    0.309    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2392 on 106 degrees of freedom
Multiple R-squared:  0.825, Adjusted R-squared:  0.8217 
F-statistic: 249.9 on 2 and 106 DF,  p-value: < 2.2e-16

trend.cubic<-lm(revtq~trend+trendsq+trendcb, data=mydata)
summary(trend.cubic)


Call:
lm(formula = revtq ~ trend + trendsq + trendcb, data = mydata)

Residuals:
    Min      1Q  Median      3Q     Max 
-3785.9 -1432.9  -311.6  1363.2  6988.4 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.765e+03  9.344e+02   4.030 0.000106 ***
trend        3.317e+02  7.323e+01   4.529 1.57e-05 ***
trendsq     -3.431e+00  1.543e+00  -2.223 0.028340 *  
trendcb      1.919e-02  9.224e-03   2.080 0.039926 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2356 on 105 degrees of freedom
Multiple R-squared:  0.8319,    Adjusted R-squared:  0.8271 
F-statistic: 173.3 on 3 and 105 DF,  p-value: < 2.2e-16

$Discussion$ $q3.1$ $and$ $q3.2:$

Looking at the t-values greater than 1.96 critical value and p-values less than 5% significance errors, I can conclude that the trend variable’s estimates are significant and by looking at the R-squared, they account for 82% of the quarterly revenue (high residual here doesn’t mean the model is good since I am ignoring any possible large residuals for now).

$Discussion$ $for$ $q3.3:$

Because of the upward non-linear trend of the revenue series, I prefer the non-linear cubic trend model as it best fits the data points shown by the high r-squared.

fit1<-mydata %>% 
  plot_ly(x = ~datadate) %>% 
  add_trace(y = ~revtq, mode='line', type='scatter', name='Revenue')  %>% 
  add_lines(x = ~datadate, y = fitted(trend.lin), name='Linear fit', mode='line', alpha=1)
fit2<-mydata %>% 
  plot_ly(x = ~datadate) %>% 
  add_trace(y = ~revtq, mode='line', type='scatter', name='Revenue')  %>% 
  add_lines(x = ~datadate, y = fitted(trend.quad), name='Quadratic fit', mode='line', alpha=1)
fit3<-mydata %>% 
  plot_ly(x = ~datadate) %>% 
  add_trace(y = ~revtq, mode='line', type='scatter', name='Revenue') %>% 
  add_lines(x = ~datadate, y = fitted(trend.cubic), name='Cubic fit', mode='line', alpha=1)
subplot(fit1,fit2,fit3, margin=0.03)%>%layout(title='LINEAR, QUADRATIC, & CUBIC FIT ON REVENUE', plot_bgcolor='#e5ecf6')

Discussion q3.4:

Discuss the differences between the linear, quadratic and cubic trend models

In accordance with the OLS assumption: - The quadratic model fits it well but the cubic model fits the peaks better starting from 1995.

The linear model fits well but there’s a considerate gap (error) between the fitted and the observed values from 1995 to 2010. This error can be large as the unit is in millions.

QUESTION 4

trend.cubic.seas <- lm(revtq ~ trend + trendsq + trendcb + factor(fqtr), mydata)
summary(trend.cubic.seas)


Call:
lm(formula = revtq ~ trend + trendsq + trendcb + factor(fqtr), 
    data = mydata)

Residuals:
    Min      1Q  Median      3Q     Max 
-2393.9  -963.0   -28.1   915.8  3970.4 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)   2278.93191  598.26008   3.809 0.000238 ***
trend          338.61050   43.51015   7.782 6.03e-12 ***
trendsq         -3.47707    0.91680  -3.793 0.000253 ***
trendcb          0.01901    0.00548   3.469 0.000766 ***
factor(fqtr)2  369.70019  380.74009   0.971 0.333843    
factor(fqtr)3  382.22911  380.96299   1.003 0.318079    
factor(fqtr)4 4529.77425  377.48553  12.000  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1399 on 102 degrees of freedom
Multiple R-squared:  0.9424,    Adjusted R-squared:  0.9391 
F-statistic: 278.3 on 6 and 102 DF,  p-value: < 2.2e-16

Discussion for q4:

q4.1 Why is factor(fqtr)1 not among the regressors?

Because the 1st quarter, factor(fqtr)1 would have a coefficient of exactly -/+ 1 indicating perfect correlation with the other dummy variables therefore subjecting the model to the problem of perfect multi-colinearity thus, R automatically drops factor(fqtr)1.

4.2 Discuss the coefficient estimates of the quarterly dummy variables

Looking at the coefficient estimates, The 2nd quarter outgrows the 1st quarter by an average revenue of 370 million and the 3rd quarter outgrows the 2nd quarter by 382 million on average and the 4th quarter outgrows the 3rd quarter by 4 billion.

4.3 Are the coefficients of the trend variables and their significance different from before? Why/Why not?

The coefficients of the trend variables and their significance is different from before because we added more variables which affected the beta1 values as the R-squared jumped from 82% before to 94% after dummy variables were added to the regression (noting that the high R-squared doesn’t equate to the model being good in this case).

QUESTION 5

trend.cubic.ar <- lm(revtq ~ trend + trendsq + trendcb + factor(fqtr) + lag(revtq), mydata)
summary(trend.cubic.ar)


Call:
lm(formula = revtq ~ trend + trendsq + trendcb + factor(fqtr) + 
    lag(revtq), data = mydata)

Residuals:
     Min       1Q   Median       3Q      Max 
-2151.11  -276.75   -25.67   296.46  2304.39 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)   -3.284e+03  5.486e+02  -5.986 3.38e-08 ***
trend          6.323e+01  3.353e+01   1.886   0.0622 .  
trendsq       -9.819e-01  5.911e-01  -1.661   0.0998 .  
trendcb        6.433e-03  3.459e-03   1.860   0.0659 .  
factor(fqtr)2  4.102e+03  3.501e+02  11.717  < 2e-16 ***
factor(fqtr)3  3.793e+03  3.327e+02  11.399  < 2e-16 ***
factor(fqtr)4  7.929e+03  3.320e+02  23.883  < 2e-16 ***
lag(revtq)     8.528e-01  6.137e-02  13.897  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 825 on 100 degrees of freedom
  (1 observation deleted due to missingness)
Multiple R-squared:   0.98, Adjusted R-squared:  0.9786 
F-statistic: 701.2 on 7 and 100 DF,  p-value: < 2.2e-16

Discussion:

The auto-regressive coefficient lag(revtq) being 0.8 is persistent and the trend variables estimates are significant at 10% significance error. In addition, the dummy variables estimates are significant indicating seasonality in the quarters.

RESIDUAL ANALYSIS

qplot(mydata$datadate[-1], residuals(trend.cubic.ar), geom = "line") +
labs(x=NULL,y=NULL) + theme(panel.background = element_rect(fill = "#ADD8E6"))

Discussion:

Question: Plot the residuals (or errors) of the previous model. Discuss their magnitude and whether there are systematic patterns that could be used to improve the model

The unit in millions shows that the residuals can get large meaning the fitted model would miss the observed values by billions of dollars.
We could add an interaction term which in correlation with the trend and dummy variables would fit the seasonal peaks better thus reducing the residuals.
There’s evidence for peak seasonality in the years preceding 2000.
There’s also evidence of underestimation in the periods from 2015 before 2020

Question: If the trend-stationary model is correct the residuals should be independent; how could we evaluate quantitatively the independence of the residuals?

We could calculate the auto-correlation function for the residuals and plot them to see how persistent the correlation is for the deviations.

QUESTION 6

logfit<-lm(formula = log(revtq) ~ trend + trendsq + trendcb + factor(fqtr) +
lag(log(revtq)), data = mydata)

summary(logfit)


Call:
lm(formula = log(revtq) ~ trend + trendsq + trendcb + factor(fqtr) + 
    lag(log(revtq)), data = mydata)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.160579 -0.020345  0.000578  0.023085  0.139139 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)      1.598e+00  5.305e-01   3.013  0.00328 ** 
trend            8.800e-03  2.612e-03   3.369  0.00107 ** 
trendsq         -1.059e-04  3.683e-05  -2.876  0.00492 ** 
trendcb          5.070e-07  1.982e-07   2.558  0.01203 *  
factor(fqtr)2    2.749e-01  2.254e-02  12.193  < 2e-16 ***
factor(fqtr)3    2.554e-01  2.093e-02  12.201  < 2e-16 ***
factor(fqtr)4    5.238e-01  2.066e-02  25.356  < 2e-16 ***
lag(log(revtq))  7.814e-01  6.112e-02  12.785  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.04434 on 100 degrees of freedom
  (1 observation deleted due to missingness)
Multiple R-squared:  0.9901,    Adjusted R-squared:  0.9894 
F-statistic:  1426 on 7 and 100 DF,  p-value: < 2.2e-16

qplot(mydata$datadate[-1], residuals(logfit), geom = "line") +
labs(x=NULL,y=NULL) + theme_dark() + theme(panel.background = element_rect(fill = "#FBCEB1"))

# Evidence for why the residuals are independent
ggAcf(residuals(logfit), 50)+ labs(title="RESIDUAL", subtitle="Log Revenue")+ theme(panel.background = element_rect(fill = "#FBCEB1"))

Discussion:

Question Estimate the model in the previous question but using log(revtq) instead of revtq. Discuss the coefficient estimates. Are the results different? which model do you prefer and why? Are the residuals independent? Provide evidence.

The coefficient estimates are different from the cubic trend.
I prefer the log trend because its significant residual periods are more independent (less persistent) compared to the cubic trend.

QUESTION 7

MANUALLY FORECAST TARGET’s REVENUES

From q4

Q4 model: 2278.932 + 338.611(110) - 3.48(110)^2 + 0.019(110)^3
```
      $23 Billion
```

Even though I am flabbergasted that 19 data analysts have a low revenue estimate of $23.7 Billion even though the average is 24.4 Billion for April 2022 thus, I am not happy with my forecast being short by 700 million.

I have attached a screenshot of the analyst estimate on yahoo below

ADF TESTING

mydata<- ts(mydata)
adf.test(ts(mydata[,"revtq"]), k=4)


    Augmented Dickey-Fuller Test

data:  ts(mydata[, "revtq"])
Dickey-Fuller = -1.6576, Lag order = 4, p-value = 0.7187
alternative hypothesis: stationary

CONCLUSION

Revenue series: Deviations are $not$ $stationary$

DETRENDING

logrevtq <- 100 * diff(log(mydata[,"revtq"]))
summary(ur.df(logrevtq, type="trend", selectlags="AIC"))


############################################### 
# Augmented Dickey-Fuller Test Unit Root Test # 
############################################### 

Test regression trend 


Call:
lm(formula = z.diff ~ z.lag.1 + 1 + tt + z.diff.lag)

Residuals:
    Min      1Q  Median      3Q     Max 
-27.029 -11.034  -8.297  16.335  38.211 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  4.18745    3.50444   1.195  0.23490    
z.lag.1     -1.94502    0.16496 -11.791  < 2e-16 ***
tt          -0.02041    0.05586  -0.365  0.71558    
z.diff.lag   0.27384    0.09370   2.922  0.00428 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 17.59 on 102 degrees of freedom
Multiple R-squared:  0.7798,    Adjusted R-squared:  0.7733 
F-statistic: 120.4 on 3 and 102 DF,  p-value: < 2.2e-16


Value of test-statistic is: -11.791 46.357 69.5145 

Critical values for test statistics: 
      1pct  5pct 10pct
tau3 -3.99 -3.43 -3.13
phi2  6.22  4.75  4.07
phi3  8.43  6.49  5.47

CONCLUSION

log difference of revenue series: deviations around the trend are $stationary$

WRONG FORECAST FIGURES

😅

# Wrong forecasts:
mean(diff(mydata[,"revtq"])) # to get the mule of the ts

[1] 222.2037

sd(diff(mydata[,"revtq"]))/sqrt(length(diff(mydata[,"revtq"])))

[1] 303.2489

m<-222.2037
st<-303.2489
forecast1 <- arima(ts(mydata[,"revtq"]), order=c(1,0,0), method="CSS")
forecast(forecast1, h=1)

    Point Forecast   Lo 80    Hi 80    Lo 95    Hi 95
110       29014.45 25102.3 32926.61 23031.33 34997.58

forecast2<- ar(ts(mydata[,"revtq"]), order.max=1, aic=TRUE, method="ols")
predict(forecast2)

$pred
Time Series:
Start = 110 
End = 110 
Frequency = 1 
[1] 29016.64

$se
Time Series:
Start = 110 
End = 110 
Frequency = 1 
[1] 3052.671

forecast3 <- arima(ts(mydata[,"revtq"]), order=c(0,1,0), method="CSS")
forecast(forecast3, h=1)

    Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
110          30996 26965.92 35026.08 24832.52 37159.48

rwf(ts(mydata[,"revtq"]), h=1, drift=TRUE)

    Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
110        31218.2 27179.45 35256.96 25041.46 37394.94