Introduction

The dataset represents the list of closing price which is also the last price if anyone paid for a share of that stock during business hours of the exchange where the stock trade is done.
Whereas the opening price is the price where the first transaction is set to be the open transaction of any business day.
Usually, ASX uses a Closing Single Price Auction where it generates a price that reflects the interaction of supply demands in the market.

Aim

To use a model-building technique as to examine the given dataset and select the best-fitting model, followed by a prediction for the next 5 observations.
Determining the trends between Deterministic and Stochastic.
Choosing appropriate model to satisfy the deterministic trend.
Fitting the Regression Model to the Data

Method

  1. R software is used for time series analysis on the data.
  2. Time Series Load
  3. Calculating Variance and Correlation techniques
  4. Regression modelling
  5. Forecasting

Data Exploration and Manipulation

Dataset Description

The dataset contains 144 observations out of a total of 252 trading days in a year, with all the data collected in the same year and on consecutive trading days.

Importing packages

For modelling and prediction, we’ll use functions from the TSA package. R functions and datasets includes in the TSA package. The TSA package includes utilities for performing unit root tests, which might be useful for diagnostics and model specification.

## Library Load

rm(list=ls())
library(TSA)
## 
## Attaching package: 'TSA'
## The following objects are masked from 'package:stats':
## 
##     acf, arima
## The following object is masked from 'package:utils':
## 
##     tar

Reading the data and setting the path

The setwd() method is used to set the working directory, and R’s fetching the csv function is used as to read the data as Stock.csv
The output tells us that the dataset is a dataframe with 144 observations with having 2 variables.
We rename this variable to Stocks to make it clearer.

## Setting Path

getwd()
## [1] "D:/RMIT_Lec/Semester 3/Time Series Analysis/MATH1318 Time Series Analysis - Assignment - 1"
setwd("D:/RMIT_Lec/Semester 3/Time Series Analysis/MATH1318 Time Series Analysis - Assignment - 1")
Stock <- read.csv("Stock.csv")
head(Stock)
##   X  x
## 1 1 80
## 2 2 80
## 3 3 80
## 4 4 80
## 5 5 80
## 6 6 80

The expected datatype for time series analysis is “Time Series” as we don’t have a time series datatype by default, we’ll need to transform the datatype. To convert the data into a time series, we use the ts() method from the TSA package.

Syntax: ts(df$col, frequency=z)

The data is recorded for every business day where the shares of the stock during business hours is calculated. The analysis is done by working on day-to-day frequency report. Thus, the frequency is taken as total number of 144 days, which gives the analysis of data for 144days.

## Load Time Series

Stocks<-ts(Stock$x, frequency = 144)

Analysis of deterministic trend models for model fitting

Using a model-building method, assess the given dataset and determine the best-fitting model among the linear, quadratic, cosine, and seasonal models.
To use the best model to forecast annual changes for the following five years.

Plotting the data

The plot() function is used to plot the time series data, which is then examined for the following 5 valid points: 1. Seasonality 2. Trend 3. Behaviour 4. Intervention 5. Change in Variance

plot(Stocks,type='l',ylab='Average Shares ',xlab = 'Trading Days',main ="A randomly generated time series with deterministic trend.")

plot(y=Stocks,x=zlag(Stocks),ylab='Average Shares', xlab='Trading Days' , main = "Scatter plot for Closing Shares prices")

Analysis of the 5 Valid points

Seasonality: Overall, we can see that the time series plot has a definite repeating pattern. This is due to the mid-section of the plot, which ranges from 1.5 to 2.0, indicating that the model has seasonality. However, there are clues of pattern at the start of the plot, leading us to wonder if this model can display non-seasonality properties. At this stage, we can affirm that the time series is seasonal. As a result, it demonstrates the series’ seasonality features.
Trend: The time series graphic demonstrates the significant declining trend.
Change in Variance: We can say there is a slight change in variance. If you observe carefully there are fluctuations between 1.5 to 1.99 which indicates higher jumps exactly from 1.6 to 2.0. This indicates change in variance in the data.
Intervention Point: Clearly, there is no point of intervention. A change point happens when the plot’s qualities undergo a significant/dramatic change. There is no comparable behaviour in the data over here, indicating that there was no intervention point.

Impact of previous year(s) on the current year

We do the correlation for the specified variables after loading the Time series; for variance, we use Zlag to generate the initial lag of stocks, and for y value, we use Stocks because the data is acquired from the closing prices of shares.

After this index is generated, it assists us in eliminating the first NA values as well as the final five missing values. When it comes to correlation, if the cor value is close to 1, the stated analysis is acceptable to implement.

The Pearson correlation coefficient can be used to determine the impact of prior days on the current day. It’s calculated by multiplying the covariance of the two variables by their standard deviations as a product. The result is normalized and always falls within the range of -1 and 1. The negative correlation between the two is explained by the negative value, while the positive correlation is explained by the positive value.

Let’s start by calculating the pearson’s correlation between today’s Y[t] and yesterday’s Y[t-1], then move on to days Y[t-2], Y[t-3], and so on, until we get a value less than or equal to 1.

This pearson’s correlation is followed by a correlation plot which helps us visually understand the relationship between the two.

First Lag

PEARSON CORRELATION OF Y[t] and Y[t-1]

summary(Stocks)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   15.47   71.33   81.25   75.41   85.05   95.48
# Data for closing Prices of Shares

y = Stocks

# Generating first lag of stocks

x = zlag(Stocks)

# Creating index to get rid of the first NA values and the last 5 missing values in x

index = 2:length(x)
cor(y[index],x[index])
## [1] 0.9678333

Inference

The correlation of 0.9678333 Y[t] and Y[t-1] appear to be positively connected. As a result, a substantial amount of data can be recovered from previous values (lag 1). By showing the first lag along the x axis and the actual value along the y axis, this correlation can be seen.

CORRELATION PLOT OF Y[t] and Y[t-1]

plot(y=Stocks,x=zlag(Stocks),ylab='Average Shares', xlab='Trading Days' , main = "Scatter plot for showing autocorrelation between Y[t] and Y[t-1]")

Inference

The positive correlation between the two is explained by the fact that as Y[t] increases, so does Y[t-1]. As a result, the past value can be used to retrieve a substantial amount of data (lag 1)

Second Lag

PEARSON CORRELATION OF Y[t] and Y[t-2]

# Data for closing Prices of Shares

y = Stocks

# Generating first lag of stocks

x = zlag(zlag(Stocks))       

# Creating index to get rid of the first NA values and the last 5 missing values in x

index = 3:length(x)
cor(y[index],x[index])
## [1] 0.9198548

Inference

The correlation of 0.9198548 suggest that Y[t] and Y[t-2] are positively correlated but the correlation and it is not stronger. Plotting second lag along the x axis and the actual value along the y axis reveals this correlation.

CORRELATION PLOT OF Y[t] and Y[t-2]

plot(y=Stocks,x=zlag(zlag(Stocks)),ylab='Average Shares', xlab='Trading Days' , main = "Scatter plot for showing autocorrelation between Y[t] and Y[t-2]")

Inference

There is still a positive correlation between Y[t] and Y[t-2]. Higher values of Y[t] gives higher values of Y[t-2] which explains their correlation. However, it is not stronger compared to the first lag.

Third Lag

PEARSON CORRELATION OF Y[t] and Y[t-3]

# Data for closing Prices of Shares

y = Stocks

# Generating first lag of stocks

x = zlag(zlag(zlag(Stocks)))

# Creating index to get rid of the first NA values and the last 5 missing values in x

index = 4:length(x)
cor(y[index],x[index])
## [1] 0.9088598

Inference

The correlation of 0.9088598 suggest that Y[t] and Y[t-3] are positively correlated. This correlation can be understood by plotting second lag along the x axis and the real values along the y axis.

CORRELATION PLOT OF Y[t] and Y[t-3]

plot(y=Stocks,x=zlag(zlag(zlag(Stocks))),ylab='Average Shares', xlab='Trading Days' , main = "Scatter plot for showing autocorrelation between Y[t] and Y[t-3]")

Inference

Not all higher values of Y[t] have higher values as of Y[t-3] which explains correlation is strong.

Model Estimation

Deterministic Trend Models

The model’s qualities are still an unknown to us. One method of estimate is to fit a regression model. To model a non-constant mean trend, regression analysis can be applied.

Regression Trend Models: 1. Linear 2. Quadratic 3. Cosine

Linear trend Model

First, get the time stamps of the time series

# Placing Linear model

Lmodel_1 = lm(Stocks ~ time(Stocks)) # Naming the linear trend model
summary(Lmodel_1)
## 
## Call:
## lm(formula = Stocks ~ time(Stocks))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -41.083  -8.949   0.231   8.896  23.370 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   132.252      5.316   24.88   <2e-16 ***
## time(Stocks)  -37.983      3.488  -10.89   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.08 on 142 degrees of freedom
## Multiple R-squared:  0.4551, Adjusted R-squared:  0.4513 
## F-statistic: 118.6 on 1 and 142 DF,  p-value: < 2.2e-16

Modelling

The lm() function is used to calculate the regression model, while the abline() function is used to fit the model.

plot(Stocks,type='o',ylab='Average shares')
# Add the fitted trend line

abline(Lmodel_1)

Summary of the model

  1. The coefficient of t is -37.983, this means there is a 37.983 decrement for each time period.
  2. Pr(>|t|) for t is < 0.05, therefore there is significant information in t and this component can contribute for modelling.
  3. R-squared value for the model is 0.4513. We would like to have R-squared value between 0.75 and 0.85. Over here 0.4513 says that the model is not good enough and it is rejected.
  4. The overall p-value is less than 0.05 which means this trend model is good for fitting.

Residual Analysis of the linear trend model

Time series plot of standardized residuals: A random white noise sequence of standardised residuals is ideal, as it indicates that the fitted trend model has caught the bulk of the observations.

Histogram Plot

The frequency of standardized residuals is represented via a histogram. To firmly support the trend fit, we need a symmetric distribution of standardized residuals, which ideally doesn’t fit.

Q-Q plot

On the QQ plot of normalized residuals, we want to see all the dots should stick to the reference line.

Shapiro-wilk Test

The Shapiro-Wilk test is used to determine whether standardized residuals are normal. Ho: Data are normally distributed HA: Data are not normally distributed If p > 0.05, the null hypothesis is rejected, and we conclude that the data is not normal. If p > 0.05, however, we cannot reject the null hypothesis and must conclude that the data is regularly distributed. As here p value is 0.01413.

ACF & PACF Plot

In an ACF and PACF graphic of standardized residuals, we anticipate observing all bars under the horizontal dashed confidence boundaries. This explains why all the data has been captured and there is no more data to capture in the residuals.

# Residual analysis

res.Lmodel_1 = rstudent(Lmodel_1)
win.graph(width=4.7, height=3,pointsize=8) 
par(mfrow=c(3,2))

# TS Plot

plot(y = res.Lmodel_1, x = as.vector(time(Stocks)),xlab = 'Trading Days', ylab='Average Shares',type='l',main = "Standardised residuals of linear model.")

# Histogram

hist(res.Lmodel_1,xlab='Standardized Residuals', main = "Standardised residuals histogram")

# QQ Plot

qqnorm(y=res.Lmodel_1, main = "Standardised residuals in a QQ plot")
qqline(y=res.Lmodel_1, col = 2, lwd = 1, lty = 2)

# Shapiro Wilk Test

shapiro.test(res.Lmodel_1)
## 
##  Shapiro-Wilk normality test
## 
## data:  res.Lmodel_1
## W = 0.97653, p-value = 0.01413
# ACF Test

acf(res.Lmodel_1, main = "ACF of standardized residuals.")

# PACF Test

pacf(res.Lmodel_1, main = "PACF of standardized residuals.")

par(mfrow=c(1,1))

Summary of residual analysis

A random white noise sequence of standardized residuals is desirable since it suggests that the fitted trend model has correctly predicted most of the data.

A histogram is used to depict the frequency of standardized residuals. A symmetric distribution of standardized residuals, which ideally does not fit, is required to firmly establish the trend fit.

We want to see all the dots on the QQ plot of normalized residuals cling to the reference line.

To evaluate if standardized residuals are normal, the Shapiro-Wilk test is utilized.

The null hypothesis is rejected if p > 0.05, and we conclude that the data are not normal. If p > 0.05, however, the null hypothesis cannot be rejected, and we must infer that the data is normally distributed.

Thus, histogram plot does not support that the trend model captures majority of the observation.

We expect to see all bars beneath the horizontal dashed confidence boundaries in an ACF and PACF picture of standardized residuals. This explains why the residuals have no more data to record because all the data has been captured.

p < 0.05 confirms that the data is not normally distributed.

Quadratic Trend Model

Modelling

Over here, t2 is calculated by squaring t which is used as a parameter in a quadratic model.

# Quadratic model
# Get time points

t = time(Stocks)

# Create t^2

t2 = t^2

# labeling the quadratic trend model as model2

Lmodel_2 = lm(Stocks~ t + t2)
summary(Lmodel_2)
## 
## Call:
## lm(formula = Stocks ~ t + t2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.8176  -3.5322  -0.7553   3.6988  12.5656 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -175.12      13.81  -12.68   <2e-16 ***
## t             388.67      18.93   20.54   <2e-16 ***
## t2           -142.55       6.30  -22.62   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.634 on 141 degrees of freedom
## Multiple R-squared:  0.8823, Adjusted R-squared:  0.8807 
## F-statistic: 528.6 on 2 and 141 DF,  p-value: < 2.2e-16
plot(ts(fitted(Lmodel_2)), ylim = c(min(c(fitted(Lmodel_2), as.vector(Stocks))), max(c(fitted(Lmodel_2),as.vector(Stocks)))),ylab='Average shares' , main = "Fitted quadratic curve to Stocks", type="l",lty=2,col="blue")

# Add the fitted trend line

lines(as.vector(Stocks),type="o")

Summary of the model

1.The coefficient of t is 388.67, this means there is a 388.67 increment for each time period (t). 2.The coefficient of t2 is -142.55, this means there is a decrements of -142.55 for each time period (t2). 3.Pr(>|t|) for t and t2 is < 0.05, therefore there is significant information in t and t2. 4.R-squared value for the model is 0.8807. Ideally, we should be having R-squared value between 0.75 and 0.85. Over here 0.0.8807 says that the model is good for fitting, and it is best approach as compared to Linear trend model. 5.The overall p-value is less than 0.05 which means this trend model is good for fitting and capable of capturing the information. As, the value is smaller, so it signifies the stronger evidence to reject the Ho.

Residual Analysis of the quadratic trend model

Performing residual analysis on the quadratic trend model.

# Residual analysis

res.Lmodel_2= rstudent(Lmodel_2)
win.graph(width=4.7, height=3,pointsize=8)
par(mfrow=c(3,2))

# TS Plot

plot(y = res.Lmodel_2, x = as.vector(time(Stocks)),xlab = 'Time', ylab='Standardized Residuals',type='l',main = "Standardised residuals from quadratic model.")

# Histogram

hist(res.Lmodel_2,xlab='Standardized Residuals', main = "Histogram of standardised residuals.")

# QQ Plot

qqnorm(y=res.Lmodel_2, main = "QQ plot of standardised residuals.")
qqline(y=res.Lmodel_2, col = 2, lwd = 1, lty = 2)

## Shapiro Wilk Test

shapiro.test(res.Lmodel_2)
## 
##  Shapiro-Wilk normality test
## 
## data:  res.Lmodel_2
## W = 0.98771, p-value = 0.2324
# ACF Test

acf(res.Lmodel_2, main = "ACF of standardized residuals.")

# PACF Test

pacf(res.Lmodel_2, main = "PACF of standardized residuals.")

par(mfrow=c(1,1))

Summary of Residual Analysis of Quadratic trend

This model looks much better than the previous Linear model.

Thus, the time series plot of the standardized residuals support that the trend model captures majority.

The histogram plot of standardized residuals almost shows a symmetric distribution. Thus, it is a right skewed distribution.

Almost all the dots stick to the reference line which suggest hat the model is fitted.

Shapiro-wilk test gives a result of p = 0.2324 and p > 0.05 confirms that the data is normally distributed.

Most of the bars are within the horizontal dashed confidence boundaries in an ACF picture of standardized residuals, which explains have data is recorded and captured simultaneously.

Cosine Trend Model

The harmonics are initially calculated using the harmonic(x,m) function, where “x” denotes the time series and “m” denotes the number of pairs of harmonic functions to be constructed.

The time series data and harmonics are combined to generate a dataframe.

The cosine behavior is then modelled using the regression function lm().

Modeling

Trying to fit the cosine model for the time series data.

# calculate cos(2*pi*t) and sin(2*pi*t)

har. <- harmonic(Stocks, 1)
data <- data.frame(Stocks,har.)
Lmodel_3 <- lm(Stocks ~ cos.2.pi.t. + sin.2.pi.t. , data = data)

summary(Lmodel_3)
## 
## Call:
## lm(formula = Stocks ~ cos.2.pi.t. + sin.2.pi.t., data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -45.116  -5.243   0.342   7.315  19.029 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  75.4093     0.9136  82.538  < 2e-16 ***
## cos.2.pi.t. -14.4388     1.2921 -11.175  < 2e-16 ***
## sin.2.pi.t.   9.1899     1.2921   7.113 5.26e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.96 on 141 degrees of freedom
## Multiple R-squared:  0.5545, Adjusted R-squared:  0.5481 
## F-statistic: 87.73 on 2 and 141 DF,  p-value: < 2.2e-16
plot(ts(fitted(Lmodel_3)), ylab='y', main = "Fitted cosine wave to monthly max temp series.",
     ylim = c(min(c(fitted(Lmodel_3), as.vector(Stocks))) ,max(c(fitted(Lmodel_3), as.vector(Stocks)))), col = "red" )

lines(as.vector(Stocks),type="o")

Summary of the Model

  1. The p-values for the coefficients for cosine and sine terms are more than 0.05, indicating that they are not significant and are ineffective at capturing information.
  2. The model’s R-squared value is -0.5545. The R-squared value should ideally be between 0.75 and 0.85. -0.5545 indicates that the model is unsuitable for fitting.
  3. The aggregate p-value is greater than 0.05, implying that this trend model is not significant and unfittable.

Residual Analysis of the cosine trend model

Let’s look at the residuals of the cosine trend model.

res.Lmodel_3= rstudent(Lmodel_3)
win.graph(width=4.7, height=3,pointsize=8)
par(mfrow=c(3,2))

# TS Plot

plot(y = res.Lmodel_3, x = as.vector(time(Stocks)),xlab = 'Time', ylab='Standardized Residuals',type='l',main = "Standardised residuals from quadratic model.")

# Histogram

hist(res.Lmodel_2,xlab='Standardized Residuals', main = "Histogram of standardised residuals.")

# QQ Plot

qqnorm(y=res.Lmodel_3, main = "QQ plot of standardised residuals.")
qqline(y=res.Lmodel_3, col = 2, lwd = 1, lty = 2)

# Shapiro Wilk test

shapiro.test(res.Lmodel_3)
## 
##  Shapiro-Wilk normality test
## 
## data:  res.Lmodel_3
## W = 0.92724, p-value = 1.002e-06
# ACF Plot

acf(res.Lmodel_3, main = "ACF of standardized residuals.")

# PACF Plot

pacf(res.Lmodel_3, main = "PACF of standardized residuals.")

par(mfrow=c(1,1))

Summary of Residual Analysis

The time series plot of the standardized residuals shows no pattern.

The standardized residuals time series plot indicates that the model did not capture the majority of the data. As a result, this trend model will provide a poor fit.

The presence of a symmetric distribution is not indicated by a histogram plot.

The majority of the dots are pointing away from the reference line, indicating that the trend model fit is poor.

The Shapiro-Wilk test yields a p value of 1.002e-06, indicating that the data is not regularly distributed.

There are many lags above the confidence interval in both ACF and PACF. This demonstrates that a significant amount of data is not gathered, indicating that this strategy is ineffective.

Prediction

We may conclude that of the four deterministic models, the quadratic model has the most variety. It also adheres to the rules of normality. As a result, we may conclude that it is the most appropriate fit for our Stocks Data set.

We will now use the Quadratic model to forecast the stock market’s trend for the next five years.

data(Stock)
## Warning in data(Stock): data set 'Stock' not found
# Prediction for the next 5 years using Quadratic model
t = c(115,116,117,118,119)
t2= t^2

# In this step we will use two-step algorithm to predict the data

predict_data = data.frame(t,t2)

# Forecasting of the data

forecast_data = predict(Lmodel_2, predict_data, interval = "prediction")

# Here the interval argument displays the prediction interval printed in "forcasted_data

print(forecast_data)
##        fit      lwr      upr
## 1 -1840647 -2001108 -1680185
## 2 -1873186 -2036488 -1709885
## 3 -1906011 -2072177 -1739844
## 4 -1939120 -2108176 -1770064
## 5 -1972515 -2144486 -1800544
# Plotting the prediction data

plot(Stock, ylim = c(0,120), ylab = "Standardized Residuals", main = "5 years forcasted data for ASX Stocks")

lines(ts(as.vector(forecast_data[,1]), start = 115), col="blue", type="l")
lines(ts(as.vector(forecast_data[,2]), start = 115), col="green", type="l")
lines(ts(as.vector(forecast_data[,3]), start = 115), col="green", type="l")
legend("topright", lty=1, pch=1, col=c("black","green","blue"), text.width = 38,
       c("Data","Forecast limits(5%)", "Actual Forecasts"))

Conclusion

The data is first transformed to a time series and examined for 5 valid points, after which the data’s pearson correlation with its lags is computed.

The ACF-PACF plot explains the series is seasonal nature while also it indicates that it is non-stationary.

Assuming the series is deterministic, we try to fit linear, quadratic,and cosine regression models to it and examine the summary, followed by residual analysis for the four models.

Following the research, it is obvious that the quadratic model provides the best fit and aids in the capture of the majority of data for prediction.

The forecast for the next five years is calculated using a quadratic model. We may conclude from the deterministic trend modelling and residual analysis determines that the model has a weak deterministic trend and follows a stochastic trend.