QUESTION 1
target <- read_excel("C:/Users/kofij/Documents/targetrev.xlsx")
mydata<-target %>% dplyr::select(datadate='Data Date', fqtr='Fiscal Quarter', revtq='Revenue - Total')
mydata<-as.data.frame(mydata)
head(mydata, 3)
datadate fqtr revtq
1 1995-01-31 4 6998
2 1995-04-30 1 4757
3 1995-07-31 2 5236
QUESTION 2
q1 <- mydata %>% plot_ly(x=~datadate, y=~revtq, type='scatter', mode='line', name='Quarterly Rev')
q2 <- mydata %>% plot_ly(x=~datadate, y=~((revtq - lag(revtq))/lag(revtq)), type='scatter', mode='line', name='Diff QR')
q3 <- mydata %>% plot_ly(x=~datadate, y=~((revtq - lag(revtq))/lag(revtq)), type='scatter', mode='line', name='Log Diff QR')
subplot(q1,q2,q3, nrows=3)%>%
layout(title = list(text = "TARGET CORPORATION REVENUE"),
plot_bgcolor='#e5ecf6')
Discussion:
Discuss the results in terms of the existence of a trend, the volatility of the variable, and whether there is evidence of seasonality
- The revenue plot trends upwards and there’s a regular pattern in its peaks and troughs indicating seasonality in the quarters (except for the outlier in 2020 which can bias the pattern). Based on the differencing and log plots fluctuations, the revenues’ variance over time is not large.
QUESTION 3
mydata<-mydata %>% dplyr::mutate(trend=1:nrow(mydata),
trendsq=trend^2,
trendcb=trend^3)
QUESTION 3.1 & 3.2
trend.lin<- lm(revtq~trend, mydata)
summary(trend.lin)
Call:
lm(formula = revtq ~ trend, data = mydata)
Residuals:
Min 1Q Median 3Q Max
-4121.6 -1372.9 -565.4 1223.0 7654.0
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5615.924 461.535 12.17 <2e-16 ***
trend 162.624 7.284 22.33 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2393 on 107 degrees of freedom
Multiple R-squared: 0.8233, Adjusted R-squared: 0.8216
F-statistic: 498.5 on 1 and 107 DF, p-value: < 2.2e-16
trend.quad<-lm(revtq~trend+trendsq, data=mydata)
summary(trend.quad)
Call:
lm(formula = revtq ~ trend + trendsq, data = mydata)
Residuals:
Min 1Q Median 3Q Max
-3981.1 -1284.9 -581.4 1656.0 8163.7
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5077.4425 700.2068 7.251 7.05e-11 ***
trend 191.7313 29.3845 6.525 2.39e-09 ***
trendsq -0.2646 0.2588 -1.022 0.309
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2392 on 106 degrees of freedom
Multiple R-squared: 0.825, Adjusted R-squared: 0.8217
F-statistic: 249.9 on 2 and 106 DF, p-value: < 2.2e-16
trend.cubic<-lm(revtq~trend+trendsq+trendcb, data=mydata)
summary(trend.cubic)
Call:
lm(formula = revtq ~ trend + trendsq + trendcb, data = mydata)
Residuals:
Min 1Q Median 3Q Max
-3785.9 -1432.9 -311.6 1363.2 6988.4
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.765e+03 9.344e+02 4.030 0.000106 ***
trend 3.317e+02 7.323e+01 4.529 1.57e-05 ***
trendsq -3.431e+00 1.543e+00 -2.223 0.028340 *
trendcb 1.919e-02 9.224e-03 2.080 0.039926 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2356 on 105 degrees of freedom
Multiple R-squared: 0.8319, Adjusted R-squared: 0.8271
F-statistic: 173.3 on 3 and 105 DF, p-value: < 2.2e-16
\(Discussion\) \(q3.1\) \(and\) \(q3.2:\)
Looking at the t-values greater than 1.96 critical value and p-values less than 5% significance errors, I can conclude that the trend variable’s estimates are significant and by looking at the R-squared, they account for 82% of the quarterly revenue (high residual here doesn’t mean the model is good since I am ignoring any possible large residuals for now).
\(Discussion\) \(for\) \(q3.3:\)
Because of the upward non-linear trend of the revenue series, I prefer the non-linear cubic trend model as it best fits the data points shown by the high r-squared.
fit1<-mydata %>%
plot_ly(x = ~datadate) %>%
add_trace(y = ~revtq, mode='line', type='scatter', name='Revenue') %>%
add_lines(x = ~datadate, y = fitted(trend.lin), name='Linear fit', mode='line', alpha=1)
fit2<-mydata %>%
plot_ly(x = ~datadate) %>%
add_trace(y = ~revtq, mode='line', type='scatter', name='Revenue') %>%
add_lines(x = ~datadate, y = fitted(trend.quad), name='Quadratic fit', mode='line', alpha=1)
fit3<-mydata %>%
plot_ly(x = ~datadate) %>%
add_trace(y = ~revtq, mode='line', type='scatter', name='Revenue') %>%
add_lines(x = ~datadate, y = fitted(trend.cubic), name='Cubic fit', mode='line', alpha=1)
subplot(fit1,fit2,fit3, margin=0.03)%>%layout(title='LINEAR, QUADRATIC, & CUBIC FIT ON REVENUE', plot_bgcolor='#e5ecf6')
Discussion q3.4:
Discuss the differences between the linear, quadratic and cubic trend models
In accordance with the OLS assumption: - The quadratic model fits it well but the cubic model fits the peaks better starting from 1995.
- The linear model fits well but there’s a considerate gap (error) between the fitted and the observed values from 1995 to 2010. This error can be large as the unit is in millions.
QUESTION 4
trend.cubic.seas <- lm(revtq ~ trend + trendsq + trendcb + factor(fqtr), mydata)
summary(trend.cubic.seas)
Call:
lm(formula = revtq ~ trend + trendsq + trendcb + factor(fqtr),
data = mydata)
Residuals:
Min 1Q Median 3Q Max
-2393.9 -963.0 -28.1 915.8 3970.4
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2278.93191 598.26008 3.809 0.000238 ***
trend 338.61050 43.51015 7.782 6.03e-12 ***
trendsq -3.47707 0.91680 -3.793 0.000253 ***
trendcb 0.01901 0.00548 3.469 0.000766 ***
factor(fqtr)2 369.70019 380.74009 0.971 0.333843
factor(fqtr)3 382.22911 380.96299 1.003 0.318079
factor(fqtr)4 4529.77425 377.48553 12.000 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1399 on 102 degrees of freedom
Multiple R-squared: 0.9424, Adjusted R-squared: 0.9391
F-statistic: 278.3 on 6 and 102 DF, p-value: < 2.2e-16
Discussion for q4:
q4.1 Why is factor(fqtr)1 not among the regressors?
Because the 1st quarter, factor(fqtr)1 would have a coefficient of exactly -/+ 1 indicating perfect correlation with the other dummy variables therefore subjecting the model to the problem of perfect multi-colinearity thus, R automatically drops factor(fqtr)1.
4.2 Discuss the coefficient estimates of the quarterly dummy variables
Looking at the coefficient estimates, The 2nd quarter outgrows the 1st quarter by an average revenue of 370 million and the 3rd quarter outgrows the 2nd quarter by 382 million on average and the 4th quarter outgrows the 3rd quarter by 4 billion.
4.3 Are the coefficients of the trend variables and their significance different from before? Why/Why not?
The coefficients of the trend variables and their significance is different from before because we added more variables which affected the beta1 values as the R-squared jumped from 82% before to 94% after dummy variables were added to the regression (noting that the high R-squared doesn’t equate to the model being good in this case).
QUESTION 5
trend.cubic.ar <- lm(revtq ~ trend + trendsq + trendcb + factor(fqtr) + lag(revtq), mydata)
summary(trend.cubic.ar)
Call:
lm(formula = revtq ~ trend + trendsq + trendcb + factor(fqtr) +
lag(revtq), data = mydata)
Residuals:
Min 1Q Median 3Q Max
-2151.11 -276.75 -25.67 296.46 2304.39
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.284e+03 5.486e+02 -5.986 3.38e-08 ***
trend 6.323e+01 3.353e+01 1.886 0.0622 .
trendsq -9.819e-01 5.911e-01 -1.661 0.0998 .
trendcb 6.433e-03 3.459e-03 1.860 0.0659 .
factor(fqtr)2 4.102e+03 3.501e+02 11.717 < 2e-16 ***
factor(fqtr)3 3.793e+03 3.327e+02 11.399 < 2e-16 ***
factor(fqtr)4 7.929e+03 3.320e+02 23.883 < 2e-16 ***
lag(revtq) 8.528e-01 6.137e-02 13.897 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 825 on 100 degrees of freedom
(1 observation deleted due to missingness)
Multiple R-squared: 0.98, Adjusted R-squared: 0.9786
F-statistic: 701.2 on 7 and 100 DF, p-value: < 2.2e-16
Discussion:
The auto-regressive coefficient lag(revtq) being 0.8 is persistent and the trend variables estimates are significant at 10% significance error. In addition, the dummy variables estimates are significant indicating seasonality in the quarters.
RESIDUAL ANALYSIS
qplot(mydata$datadate[-1], residuals(trend.cubic.ar), geom = "line") +
labs(x=NULL,y=NULL) + theme(panel.background = element_rect(fill = "#ADD8E6"))
Discussion:
Question: Plot the residuals (or errors) of the previous model. Discuss their magnitude and whether there are systematic patterns that could be used to improve the model
The unit in millions shows that the residuals can get large meaning the fitted model would miss the observed values by billions of dollars.
We could add an interaction term which in correlation with the trend and dummy variables would fit the seasonal peaks better thus reducing the residuals.
There’s evidence for peak seasonality in the years preceding 2000.
There’s also evidence of underestimation in the periods from 2015 before 2020
Question: If the trend-stationary model is correct the residuals should be independent; how could we evaluate quantitatively the independence of the residuals?
We could calculate the auto-correlation function for the residuals and plot them to see how persistent the correlation is for the deviations.
QUESTION 6
logfit<-lm(formula = log(revtq) ~ trend + trendsq + trendcb + factor(fqtr) +
lag(log(revtq)), data = mydata)
summary(logfit)
Call:
lm(formula = log(revtq) ~ trend + trendsq + trendcb + factor(fqtr) +
lag(log(revtq)), data = mydata)
Residuals:
Min 1Q Median 3Q Max
-0.160579 -0.020345 0.000578 0.023085 0.139139
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.598e+00 5.305e-01 3.013 0.00328 **
trend 8.800e-03 2.612e-03 3.369 0.00107 **
trendsq -1.059e-04 3.683e-05 -2.876 0.00492 **
trendcb 5.070e-07 1.982e-07 2.558 0.01203 *
factor(fqtr)2 2.749e-01 2.254e-02 12.193 < 2e-16 ***
factor(fqtr)3 2.554e-01 2.093e-02 12.201 < 2e-16 ***
factor(fqtr)4 5.238e-01 2.066e-02 25.356 < 2e-16 ***
lag(log(revtq)) 7.814e-01 6.112e-02 12.785 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.04434 on 100 degrees of freedom
(1 observation deleted due to missingness)
Multiple R-squared: 0.9901, Adjusted R-squared: 0.9894
F-statistic: 1426 on 7 and 100 DF, p-value: < 2.2e-16
qplot(mydata$datadate[-1], residuals(logfit), geom = "line") +
labs(x=NULL,y=NULL) + theme_dark() + theme(panel.background = element_rect(fill = "#FBCEB1"))
# Evidence for why the residuals are independent
ggAcf(residuals(logfit), 50)+ labs(title="RESIDUAL", subtitle="Log Revenue")+ theme(panel.background = element_rect(fill = "#FBCEB1"))
Discussion:
Question Estimate the model in the previous question but using log(revtq) instead of revtq. Discuss the coefficient estimates. Are the results different? which model do you prefer and why? Are the residuals independent? Provide evidence.
- The coefficient estimates are different from the cubic trend.
- I prefer the log trend because its significant residual periods are more independent (less persistent) compared to the cubic trend.
QUESTION 7
MANUALLY FORECAST TARGET’s REVENUES
From q4
Q4 model: 2278.932 + 338.611(110) - 3.48(110)^2 + 0.019(110)^3
$23 Billion
Even though I am flabbergasted that 19 data analysts have a low revenue estimate of $23.7 Billion even though the average is 24.4 Billion for April 2022 thus, I am not happy with my forecast being short by 700 million.
- I have attached a screenshot of the analyst estimate on yahoo below
ADF TESTING
mydata<- ts(mydata)
adf.test(ts(mydata[,"revtq"]), k=4)
Augmented Dickey-Fuller Test
data: ts(mydata[, "revtq"])
Dickey-Fuller = -1.6576, Lag order = 4, p-value = 0.7187
alternative hypothesis: stationary
CONCLUSION
Revenue series: Deviations are \(not\) \(stationary\)
DETRENDING
logrevtq <- 100 * diff(log(mydata[,"revtq"]))
summary(ur.df(logrevtq, type="trend", selectlags="AIC"))
###############################################
# Augmented Dickey-Fuller Test Unit Root Test #
###############################################
Test regression trend
Call:
lm(formula = z.diff ~ z.lag.1 + 1 + tt + z.diff.lag)
Residuals:
Min 1Q Median 3Q Max
-27.029 -11.034 -8.297 16.335 38.211
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.18745 3.50444 1.195 0.23490
z.lag.1 -1.94502 0.16496 -11.791 < 2e-16 ***
tt -0.02041 0.05586 -0.365 0.71558
z.diff.lag 0.27384 0.09370 2.922 0.00428 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 17.59 on 102 degrees of freedom
Multiple R-squared: 0.7798, Adjusted R-squared: 0.7733
F-statistic: 120.4 on 3 and 102 DF, p-value: < 2.2e-16
Value of test-statistic is: -11.791 46.357 69.5145
Critical values for test statistics:
1pct 5pct 10pct
tau3 -3.99 -3.43 -3.13
phi2 6.22 4.75 4.07
phi3 8.43 6.49 5.47
CONCLUSION
log difference of revenue series: deviations around the trend are \(stationary\)
WRONG FORECAST FIGURES
😅
# Wrong forecasts:
mean(diff(mydata[,"revtq"])) # to get the mule of the ts
[1] 222.2037
sd(diff(mydata[,"revtq"]))/sqrt(length(diff(mydata[,"revtq"])))
[1] 303.2489
m<-222.2037
st<-303.2489
forecast1 <- arima(ts(mydata[,"revtq"]), order=c(1,0,0), method="CSS")
forecast(forecast1, h=1)
Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
110 29014.45 25102.3 32926.61 23031.33 34997.58
forecast2<- ar(ts(mydata[,"revtq"]), order.max=1, aic=TRUE, method="ols")
predict(forecast2)
$pred
Time Series:
Start = 110
End = 110
Frequency = 1
[1] 29016.64
$se
Time Series:
Start = 110
End = 110
Frequency = 1
[1] 3052.671
forecast3 <- arima(ts(mydata[,"revtq"]), order=c(0,1,0), method="CSS")
forecast(forecast3, h=1)
Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
110 30996 26965.92 35026.08 24832.52 37159.48
rwf(ts(mydata[,"revtq"]), h=1, drift=TRUE)
Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
110 31218.2 27179.45 35256.96 25041.46 37394.94