March 27, 2023library(TSA)
library(forecast)
To analyse the assignment1Data2023.csv data which represents the
return (in AUD100,000) of a share market trader’s investment portfolio.
Our aim is to:
1. Find the best fitting model among the linear,
quadratic, cosine, cyclical, seasonal trend models or their combinations
by implementing the model-building strategy.
2. Give the
predictions for the next 15 trading days using the best model found.
we will follow a multistage model-building strategy with the following three steps:
In model specification (or identification), the classes of time series models are selected that may be appropriate for a given observed series. Figure 1 hints our time series would be either linear, quadratic, seasonal, harmonic or more likely a combination of these models. We will attempt to adhere to the principle of parsimony; that is, choose a model that requires minimum parameters to represent our time series ‘returns’
The model fitting consists of finding the best possible estimates of a number of parameters involved in the model. We will stick to least squares statitical estimation technique.
Model diagnostics will be done to check our model’s fit and check if the modelling assumptions are satisfied.
Lets begin this strategy from the most basic model, Linear model.
The deterministic linear trend model is expressed as follows: $_t = _0 + _1 t $, where \(\beta_0\) represents intercept and \(\beta_1\) corresponds to the slope of the linear trend.
We test linear model for our \(returns\) time series as we see a downward trend. Lets fit the linear model and examine,
The linear model for ‘returns’ time series is,
# Generate Linear model using lm() Least squares method
model1 = lm(returns~time(returns)) # label the model as model1
# Display model summary
summary(model1)
##
## Call:
## lm(formula = returns ~ time(returns))
##
## Residuals:
## Min 1Q Median 3Q Max
## -38.053 -17.959 2.056 15.095 72.799
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 173.38533 3.85513 44.98 <2e-16 ***
## time(returns) -1.59425 0.05227 -30.50 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.59 on 125 degrees of freedom
## Multiple R-squared: 0.8816, Adjusted R-squared: 0.8806
## F-statistic: 930.3 on 1 and 125 DF, p-value: < 2.2e-16
Here, 173.38 and -1.59 coefficients for Intercept and Slope are both significant at 5% signficance level (p<0.05 ). Also, the model’s p-value is < 0.05, implying the linear model is significant.
Lets impose the linear model’s trend line over the time series to visualise the model’s fit,
# Plot linear model over time series
plot(returns,type='o',ylab='Return (in AUD100,000) of a share', xlab = 'Trading days', main = "Figure 3: Fitted linear estimates on 'returns' time series")
abline(model1)
Other than capturing the downward trend in the time series, clearly, the linear model is a very bad fit for our data although the \(R^2\) value is 88.16%. We cannot depend on \(R^2\) for goodness of fit.
If the trend model is reasonably correct, then the residuals should behave roughly like the true stochastic component, and various assumptions about the stochastic component can be assessed by looking at the residuals. Hence, lets perform residual analysis for the linear model.
The \(estimator\) or \(predictor\) of unobserved stochastic
component {\(X_t\)},
\(\hat{X_t} = Y_t - \hat{\mu_t}\)
is
called residual corresponding to the \(t^{th}\) observation.
The residuals should behave roughly like independent (normal) random variables with zero mean and standard deviation of s for a white noise process. And by a white noise process, we mean a process with zero mean and a constant variance. Our time series does not depict constant variance. Hence, it is not going to be a white noise process
Side note: The residuals roughly display the behavior of the stochastic component {\(Y_t\)} of the model. For example, if the time series models to a white noise process (zero mean and constant variance), then the residuals behave roughly like independent (normal) random variables with zero mean and standard deviation of s. From Figure 1, we see a non-constant variance, meaning our time series’s model would not be a white noise process.
Lets compute the residuals or standardized residual for Linear model and then examine various residual plots. Residuals are calculated using the rstudent() on the model.
res.model1 = rstudent(model1) # Store residuals in res.model1
# Create partitions for display
par(mfrow=c(2,2))
# Plot residuals vs weeks to check randomness of the residuals
plot(y = res.model1, x = as.vector(time(returns)),xlab = 'Time', ylab='Standardized Residuals',type='l',main = "Standardised residuals trend from linear model")
# Plot histogram (distribution) of residuals to visualize normality
hist(res.model1,xlab='Standardized Residuals', main = "Histogram of standardized residuals")
# Plot Quantile-Quantile to visualize normality
qqnorm(y=res.model1, main = "QQ plot of standardized residuals")
qqline(y=res.model1, col = 2, lwd = 1, lty = 2)
# Plot ACF to check correlation coverage by fitted model
acf(res.model1, main = "ACF of standardized residuals")
Lets examine each plot one by one for Linear model,
Standardized residuals trend
We notice patterns in the plot, which is not good. Ideally, for a good fitted model, we expect no discernible trends whatsoever. Clearly the residuals depict seasonality over time. This means, our linear model does not capture the seasonality of our time series. Also, there is a gradual increase in the range (vertical shift) as the trading days increase. That is, variance increases over time.
On the other hand, the residuals do not reflect the downward trend seen in the time series, meaning the linear model captures it to some extend (still we notice a s-shaped/wave shaped curve).
The residuals trend plot tells us that linear model is not a good fit for our series.
Histogram of standardized residuals
The histogram for standardized residuals has more negative values (residual value < 1) than positive values. This suggests our linear model estimates are lower than the observed values in general (in Figure 3, we notice, most of the series lies below the model line). Hence, more negative residuals.
Ideally, we want the residuals to display normality. Linear model does display normality, but fails the fitness (from Standardized residuals trend).
QQ plot of standardized residuals
We want the data points to be on the dotted line for normality. Clearly, both the tails are off. From the Shapiro wilk normality test, since p < 0.05 significance level, we reject the null hypothesis that states the data is normal.
# To determine normality statistically
shapiro.test(res.model1)
##
## Shapiro-Wilk normality test
##
## data: res.model1
## W = 0.97322, p-value = 0.01269
ACF of standardized residuals
An ACF plot or the Auto-correlation function plot tells us about the correlation between Share returns over different time lags (in our case, days). ACF plots tells us a lot more about the residuals, as well as the time series. The blue dotted lines mark the confidence limits for the correlation. The lines give the values beyond which the auto correlations are (statistically) significantly different from zero. The vertical bars depict significant peaks. If a model fits a series well, then all the significance bars need to be within the confidence limits.
Since all the bars exceed the confidence limits, it means the linear model does not capture any autocorrelation in the residuals. According to the ACF plot none of the hypotheses \(ρ_k=0\) can be accepted (indicating, not a white noise process). The ACF plot clearly suggests that the linear model does not fit our data.
The deterministic quadratic trend model is expressed as follows:
\(\mu_t = \beta_0 + \beta_1 t +
\beta_2t^2\),
where \(\beta_0\) represents intercept and \(\beta_1\) corresponds to the linear trend,
and \(\beta_2\) corresponds to
quadratic trend in time.
Since linear model does not provide a good fit, we test if quadratic model fits our \(returns\) series better. Lets fit the quadratic model over returns series.
# load t as time
t = time(returns)
# t2 as time squared
t2 = t^2
# Generate Quadratic model using lm() Least squares method
model2 = lm(returns~ t + t2) # label the quadratic trend model as model2
# Check model summary
summary(model2)
##
## Call:
## lm(formula = returns ~ t + t2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -38.345 -18.387 0.941 15.584 68.478
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 177.915494 5.838752 30.471 < 2e-16 ***
## t -1.804958 0.210586 -8.571 3.43e-14 ***
## t2 0.001646 0.001594 1.033 0.304
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.59 on 124 degrees of freedom
## Multiple R-squared: 0.8826, Adjusted R-squared: 0.8807
## F-statistic: 465.9 on 2 and 124 DF, p-value: < 2.2e-16
According to the p-values, the quadratic trend term is found insignificant (0.304 < 0.05 significance level). Since the quadratic term is insignificant, the quadratic model is nearly the same as linear model for our time series. The R-squared of 88.26% tells the same (for linear it was 88.16%). Thus, picking either one works. But, lets examine closely by performing the residual analysis.
Before that, Lets impose the quadratic model’s trend line over the time series for visualization,
# Plot Quadratic model over time series
plot(ts(fitted(model2)), ylim = c(min(c(fitted(model2), as.vector(returns))), max(c(fitted(model2),as.vector(returns)))),
ylab='Share Returns', xlab = 'Trading days', main = "Figure 5: Fitted quadratic estimates on 'returns' time series", type="l")
lines(as.vector(returns),type="o")
Moving on to the residual analysis.
The residuals from quadratic model are,
res.model2 = rstudent(model2)
# Create partitions for display
par(mfrow=c(2,2))
# Plot residuals vs weeks to check randomness of the residuals
plot(y = res.model2, x = as.vector(time(returns)),xlab = 'Time', ylab='Standardized Residuals',type='l',main = "Standardised residuals from Quadratic model")
# Plot histogram (distribution) of residuals to visualize normality
hist(res.model2,xlab='Standardized Residuals', main = "Histogram of standardised residuals")
# Plot Quantile-Quantile to visualize normality
qqnorm(y=res.model2, main = "QQ plot of standardised residuals")
qqline(y=res.model2, col = 2, lwd = 1, lty = 2)
# Plot ACF to check correlation coverage by fitted model
acf(res.model2, main = "ACF of standardized residuals")
Lets breakdown each plot one by one for Quadratic model,
Standardized residuals trend
The residuals trend are identical to the residual trends seen from linear model. This is expected as the quadratic term is insignificant.
Histogram of standardized residuals
The residual histogram of quadratic model is almost same as linear model. On closer look, histogram of quadratic model is slightly better as it covers residual values >1 better than the histogram of linear model (comparing standardized residuals from 2-3 from linear and quadratic models).
QQ plot of standardized residuals
QQ plot for quadratic model is almost same as that for linear model. From the Shapiro wilk normality test, since p < 0.05 significance level, we reject the null hypothesis that states the data is normal.
shapiro.test(res.model2)
##
## Shapiro-Wilk normality test
##
## data: res.model2
## W = 0.97472, p-value = 0.01764
ACF of standardized residuals
ACF of quadratic model is very slightly better than linear model (the significance bars are very slightly more inside the confidence limits than that for linear mode) but the results are same. All the bars exceed the confidence limits, meaning the quadratic model does not capture any autocorrelation in the residuals, the stochastic component of the series is not white noise, and quadratic model does not fit our data.
Since the quadratic model performs very slightly better than the linear model, we chose quadratic model over linear model. Ofcourse, quadratic model on its own is not enough. We look further to fit the seasonal aspect of the time series. We examine the seasonal and harmonic models next.
The deterministic seasonal trend model is expressed as follows: \(Y_t = \mu_t + X_t\), where \(E(X_t)=0\) for all t, and \(\mu_t\) is defined as per the seasonal
parameters as,
\(\mu_t\) = {\(\beta_1\) for season 1, \(\beta_2\) for season 2, \(\beta_3\) for season 3, … }
Since our time series clearly displays seasonality, lets fit seasonal model to ‘returns’ series.
To test a either seasonal or harmonic model the frequency of the time series cannot be 1. \(returns\) has frequency as 1.
returns
## Time Series:
## Start = 1
## End = 127
## Frequency = 1
## [1] 150.80305807 152.21215799 151.49405248 150.68153801 151.01548379
## [6] 151.05896873 152.24529300 154.32836585 154.99211320 151.47319862
## [11] 148.65887120 149.42475661 149.76723188 152.79419244 156.32484804
## [16] 155.47711985 148.64751873 144.28974550 145.15092082 145.12823085
## [21] 150.98001094 155.18967939 152.31702307 141.41912041 135.88269351
## [26] 137.35092521 136.38396383 145.79343214 151.98285080 146.45194985
## [31] 130.45934365 122.22024827 123.96194926 123.76340190 137.00970421
## [36] 143.94322947 135.92688865 117.26486197 108.57562583 109.74256049
## [41] 110.60954367 129.01589743 135.49843374 124.31976202 104.65195279
## [46] 95.26900949 94.29487663 94.37477787 117.53027695 123.77761512
## [51] 109.88510644 88.11615162 78.25287150 76.08756434 75.57225471
## [56] 102.50237665 108.08497244 92.67132160 69.40893637 58.84454029
## [61] 56.13468662 56.91251923 87.40934899 93.06782601 76.51780460
## [66] 52.35868054 41.32878155 38.34250451 40.69185825 74.43156392
## [71] 81.05790547 63.98375651 38.45748121 27.31432454 24.78323409
## [76] 29.52859185 64.81749992 71.96911598 54.21016437 26.14338285
## [81] 14.02972048 10.95491723 17.52416632 52.46378082 59.97072454
## [86] 43.17610325 12.84834639 0.01752939 -4.03357918 3.83155944
## [91] 38.65303227 47.93226560 32.04728507 0.78978042 -12.15661705
## [96] -16.54803490 -6.29142400 28.71521115 39.03123618 22.97984606
## [101] -10.09253006 -23.88937164 -28.87604996 -15.81700422 21.48876386
## [106] 33.26822703 17.48928106 -16.72770955 -32.43157779 -38.77786096
## [111] -21.00997026 18.81868817 32.96428210 16.06261533 -19.34561245
## [116] -36.37031978 -44.01729753 -20.96522767 21.14554925 37.57495177
## [121] 19.41408315 -16.29602394 -35.55475699 -44.62194939 -17.30734816
## [126] 25.25656020 43.71448310
We need to find the frequency of our TS object (returns). We can get the frequency by counting the vertical bars in the ACF plot of our model,
acf(res.model2, main = "ACF of standardized residuals for Quadratic model")
On counting the lines between each top points, we find
frequency to be 7.
Lets recreate our TS object but now with frequency of 7 instead of 1,
returns_f7 <- ts(Rawdata$x, frequency=7)
Lets get the seasons, in our case, the weekdays.
weekdays.= season(returns_f7) # Weekdays added as indicators
weekdays.
## [1] Monday Tuesday Wednesday Thursday Friday Saturday Sunday
## [8] Monday Tuesday Wednesday Thursday Friday Saturday Sunday
## [15] Monday Tuesday Wednesday Thursday Friday Saturday Sunday
## [22] Monday Tuesday Wednesday Thursday Friday Saturday Sunday
## [29] Monday Tuesday Wednesday Thursday Friday Saturday Sunday
## [36] Monday Tuesday Wednesday Thursday Friday Saturday Sunday
## [43] Monday Tuesday Wednesday Thursday Friday Saturday Sunday
## [50] Monday Tuesday Wednesday Thursday Friday Saturday Sunday
## [57] Monday Tuesday Wednesday Thursday Friday Saturday Sunday
## [64] Monday Tuesday Wednesday Thursday Friday Saturday Sunday
## [71] Monday Tuesday Wednesday Thursday Friday Saturday Sunday
## [78] Monday Tuesday Wednesday Thursday Friday Saturday Sunday
## [85] Monday Tuesday Wednesday Thursday Friday Saturday Sunday
## [92] Monday Tuesday Wednesday Thursday Friday Saturday Sunday
## [99] Monday Tuesday Wednesday Thursday Friday Saturday Sunday
## [106] Monday Tuesday Wednesday Thursday Friday Saturday Sunday
## [113] Monday Tuesday Wednesday Thursday Friday Saturday Sunday
## [120] Monday Tuesday Wednesday Thursday Friday Saturday Sunday
## [127] Monday
## Levels: Monday Tuesday Wednesday Thursday Friday Saturday Sunday
returns_f7
## Time Series:
## Start = c(1, 1)
## End = c(19, 1)
## Frequency = 7
## [1] 150.80305807 152.21215799 151.49405248 150.68153801 151.01548379
## [6] 151.05896873 152.24529300 154.32836585 154.99211320 151.47319862
## [11] 148.65887120 149.42475661 149.76723188 152.79419244 156.32484804
## [16] 155.47711985 148.64751873 144.28974550 145.15092082 145.12823085
## [21] 150.98001094 155.18967939 152.31702307 141.41912041 135.88269351
## [26] 137.35092521 136.38396383 145.79343214 151.98285080 146.45194985
## [31] 130.45934365 122.22024827 123.96194926 123.76340190 137.00970421
## [36] 143.94322947 135.92688865 117.26486197 108.57562583 109.74256049
## [41] 110.60954367 129.01589743 135.49843374 124.31976202 104.65195279
## [46] 95.26900949 94.29487663 94.37477787 117.53027695 123.77761512
## [51] 109.88510644 88.11615162 78.25287150 76.08756434 75.57225471
## [56] 102.50237665 108.08497244 92.67132160 69.40893637 58.84454029
## [61] 56.13468662 56.91251923 87.40934899 93.06782601 76.51780460
## [66] 52.35868054 41.32878155 38.34250451 40.69185825 74.43156392
## [71] 81.05790547 63.98375651 38.45748121 27.31432454 24.78323409
## [76] 29.52859185 64.81749992 71.96911598 54.21016437 26.14338285
## [81] 14.02972048 10.95491723 17.52416632 52.46378082 59.97072454
## [86] 43.17610325 12.84834639 0.01752939 -4.03357918 3.83155944
## [91] 38.65303227 47.93226560 32.04728507 0.78978042 -12.15661705
## [96] -16.54803490 -6.29142400 28.71521115 39.03123618 22.97984606
## [101] -10.09253006 -23.88937164 -28.87604996 -15.81700422 21.48876386
## [106] 33.26822703 17.48928106 -16.72770955 -32.43157779 -38.77786096
## [111] -21.00997026 18.81868817 32.96428210 16.06261533 -19.34561245
## [116] -36.37031978 -44.01729753 -20.96522767 21.14554925 37.57495177
## [121] 19.41408315 -16.29602394 -35.55475699 -44.62194939 -17.30734816
## [126] 25.25656020 43.71448310
Notice, the frequency has been changed to 7 as per the 7 weekdays.
Now, we can fit the seasonal model to our new time series \(returns_f7\),
model3 = lm(returns_f7~weekdays.-1) # -1 removes the intercept term. when we have the intercept in the model, we interpret resulting parameters as the
# difference between the first weekday and the related one. We don't want this, we want each weekdays parameters, not their difference.
summary(model3)
##
## Call:
## lm(formula = returns_f7 ~ weekdays. - 1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -96.865 -56.530 -2.747 58.007 98.773
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## weekdays.Monday 95.81 14.17 6.761 5.27e-10 ***
## weekdays.Tuesday 87.23 14.56 5.991 2.24e-08 ***
## weekdays.Wednesday 65.06 14.56 4.469 1.80e-05 ***
## weekdays.Thursday 54.72 14.56 3.758 0.000266 ***
## weekdays.Friday 52.24 14.56 3.588 0.000483 ***
## weekdays.Saturday 58.54 14.56 4.021 0.000102 ***
## weekdays.Sunday 84.50 14.56 5.804 5.39e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 61.77 on 120 degrees of freedom
## Multiple R-squared: 0.5979, Adjusted R-squared: 0.5744
## F-statistic: 25.49 on 7 and 120 DF, p-value: < 2.2e-16
From the summary statistics, the seasonal model is significant (p-value<0.05 significance level).
The coefficients for each weekday are shown. All the coefficients are significant (p < 0.05). Looking at the values of the estimates/coefficients for each weekday, we observe Monday has the largest coefficient. This means, for each season (in our case week), Mondays have the peak share returns. This makes sense as the share markets see investing pouring in on the open day, Monday. Also, we observe the values go down till Friday and then goes back up again in the weekends. We can confirm the same by looking at our time series plot fitted with weekday indicators (notice we are indicating the time axis in weeks rather than in days to indicate seasonality),
plot(returns_f7,type='l',ylab='Returns', xlab = 'Weeks')
points(y=returns_f7,x=time(returns_f7), pch=as.vector(season(returns_f7)))
R-squared:
Note the R-Squared value is quite small. Why? lets fit the model on the time series to find out,
plot(ts(fitted(model3)), ylab = 'Returns', xlab ='Trading days', main = 'Figure 6:Fitted seasonal estimates on time series of freq 7',
ylim = c(min(c(fitted(model3), as.vector(returns_f7))) ,
max(c(fitted(model3), as.vector(returns_f7)))
), xlim = c(0,127) )
lines(as.vector(returns_f7), type = 'o')
From figure 6, we understand the loss in R-squared is due to the model only fitting the middle section of our time series data. This is due to the obvious downward trend in the time series.
Note: It fits perfectly for the Trading days around day number 60 (the middle portion).
Moving on to the residual analysis to diagnose the seasonal model,
The residuals from seasonal model are,
res.model3 = rstudent(model3)
# Create partitions for display
par(mfrow=c(2,2))
# Plot residuals vs weeks to check randomness of the residuals
plot(y = res.model3, x = as.vector(time(returns_f7)),xlab = 'Weeks', ylab='Standardized Residuals',type='l',main = "Standardised residuals from seasonal model.")
# Plot histogram (distribution) of residuals to visualize normality
hist(res.model3,xlab='Standardized Residuals', main = "Histogram of standardised residuals.")
# Plot Quantile-Quantile to visualize normality
qqnorm(y=res.model3, main = "QQ plot of standardised residuals.")
qqline(y=res.model3, col = 2, lwd = 1, lty = 2)
# Plot ACF to check correlation coverage by fitted model
acf(res.model3, main = "ACF of standardized residuals.")
Lets breakdown each plot one by one for Seasonal model,
Standardized residuals trend
We see that the middle portion (around week 10) does not have any patterns. Meaning, this portion is captured well by the seasonal model. Same cannot be said for the portions on the left and right of week 10. These portions show seasonal pattern along with steady change in range (vertical shift), i.e, change in variance. Thus, the seasonal model does not capture our time series, except for the middle portion of our time series. This hints, we need to add another model to the seasonal model to fit our time series.
Histogram of standardized residuals
Ideally we want the histogram plot to have high frequency count for zero standardized residual value (that is, high normality). In our histogram plot, we see the opposite. This means our seasonal model did not capture either end of the time series properly. Hence, seasonal model alone is not a good fit for our time series.
QQ plot of standardized residuals
From the QQ plot, we see that both tails are off. This implies non-normality. This is due to the the lower and upper end of the model not fitting our time series. p value is < 0.05, hence we reject the null hypothesis which states data is normal.
shapiro.test(res.model3)
##
## Shapiro-Wilk normality test
##
## data: res.model3
## W = 0.92612, p-value = 3.159e-06
ACF of standardized residuals
All the significant bars are outside the confidence level, which indicates none of the correlation is covered by our model. The bars form a wave pattern (look at the top of vertical bars). Thus, as expected, seasonal model captures our time series’s seasonality. Except for the seasonality aspect, Seasonal model on its own does not reflect our time series.
Thus, we need to add another component to this seasonal model. But, does harmonic model better fit our time series than seasonal model? lets examine.
harmonic model includes the information on the shape of the seasonal
trend (missing in seasonal model) by assigning a cosine curve as the
mean function \(\mu_t\). The mean
function \(\mu_t\) is given as,
\(\mu_t\) = \(\beta cos(2\pi ft)\) = \(\beta_0 + \beta_1 cos(2\pi ft) + \beta_2 sin(2\pi
ft)\)
Here the constant term \(\beta_0\) represents a cosine with frequency zero
lets fit harmonic model to ‘returns’ series.
we can fit the harmonic model to our time series having frequeny 7 \(returns_f7\) by first generating the sine and cosine components using,
har.=harmonic(returns_f7,1) # Assign sine and cosine components in the har. object
head(har.)
## cos(2*pi*t) sin(2*pi*t)
## [1,] 1.0000000 -2.449213e-16
## [2,] 0.6234898 7.818315e-01
## [3,] -0.2225209 9.749279e-01
## [4,] -0.9009689 4.338837e-01
## [5,] -0.9009689 -4.338837e-01
## [6,] -0.2225209 -9.749279e-01
Now, we can fit the harmonic model as,
model4=lm(returns_f7~ har.)
summary(model4)
##
## Call:
## lm(formula = returns_f7 ~ har.)
##
## Residuals:
## Min 1Q Median 3Q Max
## -94.243 -56.084 -6.569 59.584 101.395
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 71.175 5.399 13.184 < 2e-16 ***
## har.cos(2*pi*t) 22.608 7.605 2.973 0.00355 **
## har.sin(2*pi*t) 2.731 7.665 0.356 0.72217
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 60.84 on 124 degrees of freedom
## Multiple R-squared: 0.06742, Adjusted R-squared: 0.05238
## F-statistic: 4.482 on 2 and 124 DF, p-value: 0.0132
From the summary stats of the harmonic model, we notice that the cosine coefficient is significant, but the sine coefficient is insignificant. Thus, we do not need the sine component. If we eliminate the sine component, the cosine component tells us the seasonal aspects of the time series for which seasonal model is enough.
Note, the R-squared value is very low, lets plot the harmonic trend line on our time series,
plot(ts(fitted(model4)), ylim = c(min(c(fitted(model4),as.vector(returns_f7))), max(c(fitted(model4),as.vector(returns_f7)))),
ylab='y' , main = "Fitted harmonic estimates on our time series", type="l",lty=2,col="red")
lines(as.vector(returns_f7),type="o")
We notice that the harmonic model fits our time series just like
seasonal model. But, the R-squared statistics in harmonic model is 6%
(for seasonal 59%). This \(R^2\) value
is brought down because, of the insignificance of the sine component in
harmonic model. Although, \(R^2\) is
not a good fitness test, we are leaning towards seasonal model as a
better fitting model due to avoidance of the sine component.
Moving on to the residual analysis.
The residuals from seasonal model are,
res.model4 = rstudent(model4)
# Create partitions for display
par(mfrow=c(2,2))
# Plot residuals vs weeks to check randomness of the residuals
plot(y = res.model4, x = as.vector(time(returns_f7)),xlab = 'Time', ylab='Standardized Residuals',type='l',main = "Standardised residuals from harmonic model.")
# Plot histogram (distribution) of residuals to visualize normality
hist(res.model4,xlab='Standardized Residuals', main = "Histogram of standardised residuals.")
# Plot Quantile-Quantile to visualize normality
qqnorm(y=res.model4, main = "QQ plot of standardised residuals.")
qqline(y=res.model4, col = 2, lwd = 1, lty = 2)
# Plot ACF to check correlation coverage by fitted model
acf(res.model4, main = "ACF of standardized residuals.")
Lets breakdown each plot one by one for harmonic model,
Standardized residuals trend
The plot is identical to that of seasonal model. The residuals show harmonic patterns except around week 10. If we choose harmonic model, it alone wont be enough to fit our time. series.
Histogram of standardized residuals
The plot is identical to that of seasonal model. Histogram suggests poor fit to time series.
QQ plot of standardized residuals
The plot is identical to that of seasonal model. p value is < 0.05, hence we reject the null hypothesis which states data is normal.
shapiro.test(res.model4)
##
## Shapiro-Wilk normality test
##
## data: res.model4
## W = 0.92615, p-value = 3.17e-06
ACF of standardized residuals
The plot is identical to that of seasonal model. ACF indicates poor correlation coverage. Harmonic model on its own does not reflect our time series.
All the significant bars are outside the confidence level, which indicates none of the correlation is covered by our model. The bars form a wave pattern (look at the top of vertical bars). Thus, as expected, seasonal model captures our time series’s seasonality. Except for the seasonality aspect, Seasonal model on its own does not reflect our time series.
The residual analysis suggests no difference between harmonic and seasonal models. Hence, we stick to seasonal model as a better fitting model due to avoidance of the sine component.
Now, since none of the basic models fit our time series ‘returns_f7’, we need to have a mixed these models to get a good fit. Between Linear and Quadratic, we found Quadratic model to be better fitting. And between Seasonal and Harmonic, Seasonal model fits better. Lets use combination of seasonal and quadratic models.
Lets compare these 2 models next to next and examine the fit visually,
The Seasonal*Quadratic model is given as,
model5 = lm(returns_f7~weekdays.*(t + t2) -1)
The Seasonal+Quadratic model is given as,
model6 = lm(returns_f7~weekdays.+ (t + t2) -1)
# Plot Seasonal*Quadratic model over TS object
plot(ts(fitted(model5)), ylim = c(min(c(fitted(model5),as.vector(returns_f7))), max(c(fitted(model5),as.vector(returns_f7)))),
ylab='y' , main = "Fitted Seasonal*Quadratic curve on our time series", type="l",lty=2,col="red")
lines(as.vector(returns_f7),type="o")
# Plot Seasonal+Quadratic model over TS object
plot(ts(fitted(model6)), ylim = c(min(c(fitted(model6),as.vector(returns_f7))), max(c(fitted(model6),as.vector(returns_f7)))),
ylab='y' , main = "Fitted Seasonal+Quadratic curve on our time series", type="l",lty=2,col="red")
lines(as.vector(returns_f7),type="o")
Clearly, we can see that Seasonal*Quadratic model fits our share returns time series better.
Now that model5, Seasonal*Quadratic is our well fitting model, lets perform residual analysis for diagnosis.
# Generate residuals for the model
res.model5 = rstudent(model5)
# Create partitions for display
par(mfrow=c(2,2))
# Plot residuals vs weeks to check randomness of the residuals
plot(y = res.model5, x = as.vector(time(returns_f7)),xlab = 'Weeks', ylab='Standardized Residuals',type='l',main = "Standardised residuals from seasonal model.")
# Plot histogram (distribution) of residuals to visualize normality
hist(res.model5,xlab='Standardized Residuals', main = "Histogram of standardised residuals.")
# Plot Quantile-Quantile to visualize normality
qqnorm(y=res.model5, main = "QQ plot of standardised residuals.")
qqline(y=res.model5, col = 2, lwd = 1, lty = 2)
# Plot ACF to check correlation coverage by fitted model
acf(res.model5, main = "ACF of standardized residuals.")
Lets breakdown each plot one by one for Seasonal*Quadratic model,
Standardized residuals trend
We do not notice seasonality, downward trend or increase in range (change in variance). Thus, the seasonal*quadratic model fits our time series better than other models
Histogram of standardized residuals
The plot is fairly normal. The histogram of residuals for seasonal*quadratic depicts normality better than other models
QQ plot of standardized residuals
QQ plot shows deviations from normality. The p-value is still < 0.05 indicating non-normality. Still, this is the best fitting model we have.
shapiro.test(res.model5)
##
## Shapiro-Wilk normality test
##
## data: res.model5
## W = 0.96828, p-value = 0.004454
ACF of standardized residuals
The ACF plots shows a few significance bars inside the confidence limits. But, yes, the model is not perfect fitting as we see many significance bars outside the confidence limits, indicating not all correlations are covered by our model.
From the available options, Seasonal*Quadratic comes the closest to a well fitting model. Off the top of my head, looking at Figure 1, an s-curve is notice in the downward trend rather than a linear or quadratic downward trend. Thus, maybe an exponential component can be used to either replace or add to the quadratic component. For now, we stick to these 4 models only.
Forecasting is one of the reasons regression models are built for. Lets generate the forecasts for next 15 trading days of our time series. To generate forecasts h steps ahead, we follow the following process;
Unfortunately, generating a new sequence just our time series model for Seasonal*Quadratic can be complex as there are many coefficients to be considered (7 weekdays coefficients and 2 slopes of linear and quadratic term)
summary(model5)
##
## Call:
## lm(formula = returns_f7 ~ weekdays. * (t + t2) - 1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.008 -7.299 -2.615 9.638 23.448
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## weekdays.Monday 1.730e+02 6.962e+00 24.852 < 2e-16 ***
## weekdays.Tuesday 1.742e+02 7.361e+00 23.670 < 2e-16 ***
## weekdays.Wednesday 1.759e+02 7.601e+00 23.138 < 2e-16 ***
## weekdays.Thursday 1.761e+02 7.847e+00 22.442 < 2e-16 ***
## weekdays.Friday 1.794e+02 8.099e+00 22.146 < 2e-16 ***
## weekdays.Saturday 1.844e+02 8.356e+00 22.071 < 2e-16 ***
## weekdays.Sunday 1.803e+02 8.619e+00 20.918 < 2e-16 ***
## t -1.214e+00 2.521e-01 -4.817 4.88e-06 ***
## t2 8.962e-05 1.903e-03 0.047 0.96252
## weekdays.Tuesday:t -1.558e-01 3.747e-01 -0.416 0.67848
## weekdays.Wednesday:t -6.593e-01 3.779e-01 -1.745 0.08390 .
## weekdays.Thursday:t -8.206e-01 3.810e-01 -2.154 0.03353 *
## weekdays.Friday:t -8.334e-01 3.842e-01 -2.169 0.03231 *
## weekdays.Saturday:t -1.076e+00 3.874e-01 -2.777 0.00649 **
## weekdays.Sunday:t -3.610e-01 3.906e-01 -0.924 0.35747
## weekdays.Tuesday:t2 -6.276e-04 2.894e-03 -0.217 0.82872
## weekdays.Wednesday:t2 1.114e-03 2.894e-03 0.385 0.70106
## weekdays.Thursday:t2 1.373e-03 2.894e-03 0.474 0.63613
## weekdays.Friday:t2 8.158e-04 2.894e-03 0.282 0.77855
## weekdays.Saturday:t2 4.209e-03 2.894e-03 1.455 0.14874
## weekdays.Sunday:t2 1.470e-03 2.894e-03 0.508 0.61246
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.86 on 106 degrees of freedom
## Multiple R-squared: 0.989, Adjusted R-squared: 0.9868
## F-statistic: 454.7 on 21 and 106 DF, p-value: < 2.2e-16
To avoid such complex calculations, we can use the S-ARIMA model. Data Heroes (2021) points out that S-ARIMA model can be fit to a seasonal time series.
modelarima = auto.arima(returns_f7,D=1) # auto.arima() from Forecast package
fcst = forecast(modelarima,h=15) # Generate the forecast values
fcst
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 19.14286 24.796772 23.679929 25.913615 23.08871 26.504836
## 19.28571 -11.096396 -12.879497 -9.313294 -13.82341 -8.369377
## 19.42857 -32.383886 -34.536665 -30.231106 -35.67628 -29.091494
## 19.57143 -42.736383 -45.153916 -40.318849 -46.43368 -39.039085
## 19.71429 -11.658271 -14.320649 -8.995894 -15.73003 -7.586517
## 19.85714 31.313850 28.416562 34.211139 26.88283 35.744870
## 20.00000 51.561333 48.445582 54.677085 46.79620 56.326464
## 20.14286 31.974588 27.477838 36.471337 25.09740 38.851773
## 20.28571 -4.080624 -9.938574 1.777326 -13.03959 4.878337
## 20.42857 -27.156686 -33.934465 -20.378907 -37.52241 -16.790968
## 20.57143 -38.642266 -46.141734 -31.142799 -50.11171 -27.172820
## 20.71429 -4.245787 -12.414676 3.923101 -16.73902 8.247450
## 20.85714 39.086220 30.279564 47.892876 25.61760 52.554838
## 21.00000 60.911587 51.508153 70.315021 46.53028 75.292898
## 21.14286 40.734946 29.440416 52.029477 23.46145 58.008441
We are interested in the fitted forecasts/ point forecast which tells us the forecast values of the share returns for the trading days 128 to 142. The Lo95 and Hi95 columns tells us the confidence limits of the forecast values as per the S-ARIMA model.
Lets visualize the forecasts next to our time series. The blue lines are the fitted forecast values. The shaded region covers the upper and lower confidence limits.
# Plot forecast values
plot(fcst)
points(y=returns_f7,x=time(returns_f7), pch=as.vector(season(returns_f7)))
Data Heroes (Jan 16, 2021) ‘SARIMA - Seasonal ARIMA - Forecasting Model in R’, YouTube Website, accessed 25 Mar 2023. https://www.youtube.com/watch?v=fSGn9JTa1B8.