The dataset represents the list of closing price which is also the last price if anyone paid for a share of that stock during business hours of the exchange where the stock trade is done. Whereas the opening price is the price where the first transaction is set to be the open transaction of any business day. Usually, ASX uses a Closing Single Price Auction where it generates a price that reflects the interaction of supply demands in the market.
To use a model-building technique as to examine the given dataset and select the best-fitting model, followed by a prediction for the next 5 observations. Determining the trends between Deterministic and Stochastic. Choosing appropriate model to satisfy the deterministic trend. Fitting the Regression Model to the Data
The dataset contains 144 observations out of a total of 252 trading days in a year, with all the data collected in the same year and on consecutive trading days.
For modelling and prediction, we’ll use functions from the TSA package. R functions and datasets includes in the TSA package. The TSA package includes utilities for performing unit root tests, which might be useful for diagnostics and model specification.
## Library Load
rm(list=ls())
library(TSA)
##
## Attaching package: 'TSA'
## The following objects are masked from 'package:stats':
##
## acf, arima
## The following object is masked from 'package:utils':
##
## tar
The setwd() method is used to set the working directory, and R’s fetching the csv function is used as to read the data as Stock.csv The output tells us that the dataset is a dataframe with 144 observations with having 2 variables. We rename this variable to Stocks to make it clearer.
## Setting Path
getwd()
## [1] "D:/RMIT_Lec/Semester 3/Time Series Analysis/MATH1318 Time Series Analysis - Assignment - 1"
setwd("D:/RMIT_Lec/Semester 3/Time Series Analysis/MATH1318 Time Series Analysis - Assignment - 1")
Stock <- read.csv("Stock.csv")
head(Stock)
## X x
## 1 1 80
## 2 2 80
## 3 3 80
## 4 4 80
## 5 5 80
## 6 6 80
The expected datatype for time series analysis is “Time Series” as we don’t have a time series datatype by default, we’ll need to transform the datatype. To convert the data into a time series, we use the ts() method from the TSA package.
The data is recorded for every business day where the shares of the stock during business hours is calculated. The analysis is done by working on day-to-day frequency report. Thus, the frequency is taken as total number of 144 days, which gives the analysis of data for 144days.
## Load Time Series
Stocks<-ts(Stock$x, frequency = 144)
Using a model-building method, assess the given dataset and determine the best-fitting model among the linear, quadratic, cosine, and seasonal models. To use the best model to forecast annual changes for the following five years.
The plot() function is used to plot the time series data, which is then examined for the following 5 valid points: 1. Seasonality 2. Trend 3. Behaviour 4. Intervention 5. Change in Variance
plot(Stocks,type='l',ylab='Average Shares ',xlab = 'Trading Days',main ="A randomly generated time series with deterministic trend.")
plot(y=Stocks,x=zlag(Stocks),ylab='Average Shares', xlab='Trading Days' , main = "Scatter plot for Closing Shares prices")
Seasonality: Overall, we can see that the time series plot has a definite repeating pattern. This is due to the mid-section of the plot, which ranges from 1.5 to 2.0, indicating that the model has seasonality. However, there are clues of pattern at the start of the plot, leading us to wonder if this model can display non-seasonality properties. At this stage, we can affirm that the time series is seasonal. As a result, it demonstrates the series’ seasonality features.Trend: The time series graphic demonstrates the significant declining trend.Change in Variance: We can say there is a slight change in variance. If you observe carefully there are fluctuations between 1.5 to 1.99 which indicates higher jumps exactly from 1.6 to 2.0. This indicates change in variance in the data.Intervention Point: Clearly, there is no point of intervention. A change point happens when the plot’s qualities undergo a significant/dramatic change. There is no comparable behaviour in the data over here, indicating that there was no intervention point.
We do the correlation for the specified variables after loading the Time series; for variance, we use Zlag to generate the initial lag of stocks, and for y value, we use Stocks because the data is acquired from the closing prices of shares.After this index is generated, it assists us in eliminating the first NA values as well as the final five missing values. When it comes to correlation, if the cor value is close to 1, the stated analysis is acceptable to implement.The Pearson correlation coefficient can be used to determine the impact of prior days on the current day. It’s calculated by multiplying the covariance of the two variables by their standard deviations as a product. The result is normalized and always falls within the range of -1 and 1. The negative correlation between the two is explained by the negative value, while the positive correlation is explained by the positive value.Let’s start by calculating the pearson’s correlation between today’s Y[t] and yesterday’s Y[t-1], then move on to days Y[t-2], Y[t-3], and so on, until we get a value less than or equal to 1.This pearson’s correlation is followed by a correlation plot which helps us visually understand the relationship between the two.
summary(Stocks)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 15.47 71.33 81.25 75.41 85.05 95.48
# Data for closing Prices of Shares
y = Stocks
# Generating first lag of stocks
x = zlag(Stocks)
# Creating index to get rid of the first NA values and the last 5 missing values in x
index = 2:length(x)
cor(y[index],x[index])
## [1] 0.9678333
The correlation of 0.9678333 Y[t] and Y[t-1] appear to be positively connected. As a result, a substantial amount of data can be recovered from previous values (lag 1). By showing the first lag along the x axis and the actual value along the y axis, this correlation can be seen.
plot(y=Stocks,x=zlag(Stocks),ylab='Average Shares', xlab='Trading Days' , main = "Scatter plot for showing autocorrelation between Y[t] and Y[t-1]")
The positive correlation between the two is explained by the fact that as Y[t] increases, so does Y[t-1]. As a result, the past value can be used to retrieve a substantial amount of data (lag 1)
# Data for closing Prices of Shares
y = Stocks
# Generating first lag of stocks
x = zlag(zlag(Stocks))
# Creating index to get rid of the first NA values and the last 5 missing values in x
index = 3:length(x)
cor(y[index],x[index])
## [1] 0.9198548
The correlation of 0.9198548 suggest that Y[t] and Y[t-2] are positively correlated but the correlation and it is not stronger. Plotting second lag along the x axis and the actual value along the y axis reveals this correlation.
plot(y=Stocks,x=zlag(zlag(Stocks)),ylab='Average Shares', xlab='Trading Days' , main = "Scatter plot for showing autocorrelation between Y[t] and Y[t-2]")
There is still a positive correlation between Y[t] and Y[t-2]. Higher values of Y[t] gives higher values of Y[t-2] which explains their correlation. However, it is not stronger compared to the first lag.
# Data for closing Prices of Shares
y = Stocks
# Generating first lag of stocks
x = zlag(zlag(zlag(Stocks)))
# Creating index to get rid of the first NA values and the last 5 missing values in x
index = 4:length(x)
cor(y[index],x[index])
## [1] 0.9088598
The correlation of 0.9088598 suggest that Y[t] and Y[t-3] are positively correlated. This correlation can be understood by plotting second lag along the x axis and the real values along the y axis.
plot(y=Stocks,x=zlag(zlag(zlag(Stocks))),ylab='Average Shares', xlab='Trading Days' , main = "Scatter plot for showing autocorrelation between Y[t] and Y[t-3]")
Not all higher values of Y[t] have higher values as of Y[t-3] which explains correlation is strong.
The model’s qualities are still an unknown to us. One method of estimate is to fit a regression model. To model a non-constant mean trend, regression analysis can be applied.
Regression Trend Models: 1. Linear 2. Quadratic 3. Cosine
First, get the time stamps of the time series
# Placing Linear model
Lmodel_1 = lm(Stocks ~ time(Stocks)) # Naming the linear trend model
summary(Lmodel_1)
##
## Call:
## lm(formula = Stocks ~ time(Stocks))
##
## Residuals:
## Min 1Q Median 3Q Max
## -41.083 -8.949 0.231 8.896 23.370
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 132.252 5.316 24.88 <2e-16 ***
## time(Stocks) -37.983 3.488 -10.89 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.08 on 142 degrees of freedom
## Multiple R-squared: 0.4551, Adjusted R-squared: 0.4513
## F-statistic: 118.6 on 1 and 142 DF, p-value: < 2.2e-16
The lm() function is used to calculate the regression model, while the abline() function is used to fit the model.
plot(Stocks,type='o',ylab='Average shares')
# Add the fitted trend line
abline(Lmodel_1)
Time series plot of standardized residuals: A random white noise sequence of standardised residuals is ideal, as it indicates that the fitted trend model has caught the bulk of the observations.
The frequency of standardized residuals is represented via a histogram. To firmly support the trend fit, we need a symmetric distribution of standardized residuals, which ideally doesn’t fit.
On the QQ plot of normalized residuals, we want to see all the dots should stick to the reference line.
The Shapiro-Wilk test is used to determine whether standardized residuals are normal. Ho: Data are normally distributed HA: Data are not normally distributed If p > 0.05, the null hypothesis is rejected, and we conclude that the data is not normal. If p > 0.05, however, we cannot reject the null hypothesis and must conclude that the data is regularly distributed. As here p value is 0.01413.
In an ACF and PACF graphic of standardized residuals, we anticipate observing all bars under the horizontal dashed confidence boundaries. This explains why all the data has been captured and there is no more data to capture in the residuals.
# Residual analysis
res.Lmodel_1 = rstudent(Lmodel_1)
win.graph(width=4.7, height=3,pointsize=8)
par(mfrow=c(3,2))
# TS Plot
plot(y = res.Lmodel_1, x = as.vector(time(Stocks)),xlab = 'Trading Days', ylab='Average Shares',type='l',main = "Standardised residuals of linear model.")
# Histogram
hist(res.Lmodel_1,xlab='Standardized Residuals', main = "Standardised residuals histogram")
# QQ Plot
qqnorm(y=res.Lmodel_1, main = "Standardised residuals in a QQ plot")
qqline(y=res.Lmodel_1, col = 2, lwd = 1, lty = 2)
# Shapiro Wilk Test
shapiro.test(res.Lmodel_1)
##
## Shapiro-Wilk normality test
##
## data: res.Lmodel_1
## W = 0.97653, p-value = 0.01413
# ACF Test
acf(res.Lmodel_1, main = "ACF of standardized residuals.")
# PACF Test
pacf(res.Lmodel_1, main = "PACF of standardized residuals.")
par(mfrow=c(1,1))
A random white noise sequence of standardized residuals is desirable since it suggests that the fitted trend model has correctly predicted most of the data.A histogram is used to depict the frequency of standardized residuals. A symmetric distribution of standardized residuals, which ideally does not fit, is required to firmly establish the trend fit.We want to see all the dots on the QQ plot of normalized residuals cling to the reference line.To evaluate if standardized residuals are normal, the Shapiro-Wilk test is utilized.The null hypothesis is rejected if p > 0.05, and we conclude that the data are not normal. If p > 0.05, however, the null hypothesis cannot be rejected, and we must infer that the data is normally distributed.Thus, histogram plot does not support that the trend model captures majority of the observation.We expect to see all bars beneath the horizontal dashed confidence boundaries in an ACF and PACF picture of standardized residuals. This explains why the residuals have no more data to record because all the data has been captured.p < 0.05 confirms that the data is not normally distributed.
Over here, t2 is calculated by squaring t which is used as a parameter in a quadratic model.
# Quadratic model
# Get time points
t = time(Stocks)
# Create t^2
t2 = t^2
# labeling the quadratic trend model as model2
Lmodel_2 = lm(Stocks~ t + t2)
summary(Lmodel_2)
##
## Call:
## lm(formula = Stocks ~ t + t2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.8176 -3.5322 -0.7553 3.6988 12.5656
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -175.12 13.81 -12.68 <2e-16 ***
## t 388.67 18.93 20.54 <2e-16 ***
## t2 -142.55 6.30 -22.62 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.634 on 141 degrees of freedom
## Multiple R-squared: 0.8823, Adjusted R-squared: 0.8807
## F-statistic: 528.6 on 2 and 141 DF, p-value: < 2.2e-16
plot(ts(fitted(Lmodel_2)), ylim = c(min(c(fitted(Lmodel_2), as.vector(Stocks))), max(c(fitted(Lmodel_2),as.vector(Stocks)))),ylab='Average shares' , main = "Fitted quadratic curve to Stocks", type="l",lty=2,col="blue")
# Add the fitted trend line
lines(as.vector(Stocks),type="o")
1.The coefficient of t is 388.67, this means there is a 388.67 increment for each time period (t). 2.The coefficient of t2 is -142.55, this means there is a decrements of -142.55 for each time period (t2). 3.Pr(>|t|) for t and t2 is < 0.05, therefore there is significant information in t and t2. 4.R-squared value for the model is 0.8807. Ideally, we should be having R-squared value between 0.75 and 0.85. Over here 0.0.8807 says that the model is good for fitting, and it is best approach as compared to Linear trend model. 5.The overall p-value is less than 0.05 which means this trend model is good for fitting and capable of capturing the information. As, the value is smaller, so it signifies the stronger evidence to reject the Ho.
Performing residual analysis on the quadratic trend model.
# Residual analysis
res.Lmodel_2= rstudent(Lmodel_2)
win.graph(width=4.7, height=3,pointsize=8)
par(mfrow=c(3,2))
# TS Plot
plot(y = res.Lmodel_2, x = as.vector(time(Stocks)),xlab = 'Time', ylab='Standardized Residuals',type='l',main = "Standardised residuals from quadratic model.")
# Histogram
hist(res.Lmodel_2,xlab='Standardized Residuals', main = "Histogram of standardised residuals.")
# QQ Plot
qqnorm(y=res.Lmodel_2, main = "QQ plot of standardised residuals.")
qqline(y=res.Lmodel_2, col = 2, lwd = 1, lty = 2)
## Shapiro Wilk Test
shapiro.test(res.Lmodel_2)
##
## Shapiro-Wilk normality test
##
## data: res.Lmodel_2
## W = 0.98771, p-value = 0.2324
# ACF Test
acf(res.Lmodel_2, main = "ACF of standardized residuals.")
# PACF Test
pacf(res.Lmodel_2, main = "PACF of standardized residuals.")
par(mfrow=c(1,1))
This model looks much better than the previous Linear model.Thus, the time series plot of the standardized residuals support that the trend model captures majority.The histogram plot of standardized residuals almost shows a symmetric distribution. Thus, it is a right skewed distribution.Almost all the dots stick to the reference line which suggest hat the model is fitted.Shapiro-wilk test gives a result of p = 0.2324 and p > 0.05 confirms that the data is normally distributed.Most of the bars are within the horizontal dashed confidence boundaries in an ACF picture of standardized residuals, which explains have data is recorded and captured simultaneously.
The harmonics are initially calculated using the harmonic(x,m) function, where “x” denotes the time series and “m” denotes the number of pairs of harmonic functions to be constructed.The time series data and harmonics are combined to generate a dataframe.The cosine behavior is then modelled using the regression function lm().
Trying to fit the cosine model for the time series data.
# calculate cos(2*pi*t) and sin(2*pi*t)
har. <- harmonic(Stocks, 1)
data <- data.frame(Stocks,har.)
Lmodel_3 <- lm(Stocks ~ cos.2.pi.t. + sin.2.pi.t. , data = data)
summary(Lmodel_3)
##
## Call:
## lm(formula = Stocks ~ cos.2.pi.t. + sin.2.pi.t., data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -45.116 -5.243 0.342 7.315 19.029
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 75.4093 0.9136 82.538 < 2e-16 ***
## cos.2.pi.t. -14.4388 1.2921 -11.175 < 2e-16 ***
## sin.2.pi.t. 9.1899 1.2921 7.113 5.26e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.96 on 141 degrees of freedom
## Multiple R-squared: 0.5545, Adjusted R-squared: 0.5481
## F-statistic: 87.73 on 2 and 141 DF, p-value: < 2.2e-16
plot(ts(fitted(Lmodel_3)), ylab='y', main = "Fitted cosine wave to monthly max temp series.",
ylim = c(min(c(fitted(Lmodel_3), as.vector(Stocks))) ,max(c(fitted(Lmodel_3), as.vector(Stocks)))), col = "red" )
lines(as.vector(Stocks),type="o")
Let’s look at the residuals of the cosine trend model.
res.Lmodel_3= rstudent(Lmodel_3)
win.graph(width=4.7, height=3,pointsize=8)
par(mfrow=c(3,2))
# TS Plot
plot(y = res.Lmodel_3, x = as.vector(time(Stocks)),xlab = 'Time', ylab='Standardized Residuals',type='l',main = "Standardised residuals from quadratic model.")
# Histogram
hist(res.Lmodel_2,xlab='Standardized Residuals', main = "Histogram of standardised residuals.")
# QQ Plot
qqnorm(y=res.Lmodel_3, main = "QQ plot of standardised residuals.")
qqline(y=res.Lmodel_3, col = 2, lwd = 1, lty = 2)
# Shapiro Wilk test
shapiro.test(res.Lmodel_3)
##
## Shapiro-Wilk normality test
##
## data: res.Lmodel_3
## W = 0.92724, p-value = 1.002e-06
# ACF Plot
acf(res.Lmodel_3, main = "ACF of standardized residuals.")
# PACF Plot
pacf(res.Lmodel_3, main = "PACF of standardized residuals.")
par(mfrow=c(1,1))
The time series plot of the standardized residuals shows no pattern.The standardized residuals time series plot indicates that the model did not capture the majority of the data. As a result, this trend model will provide a poor fit.The presence of a symmetric distribution is not indicated by a histogram plot.The majority of the dots are pointing away from the reference line, indicating that the trend model fit is poor.The Shapiro-Wilk test yields a p value of 1.002e-06, indicating that the data is not regularly distributed.There are many lags above the confidence interval in both ACF and PACF. This demonstrates that a significant amount of data is not gathered, indicating that this strategy is ineffective.
We may conclude that of the four deterministic models, the quadratic model has the most variety. It also adheres to the rules of normality. As a result, we may conclude that it is the most appropriate fit for our Stocks Data set.We will now use the Quadratic model to forecast the stock market’s trend for the next five years.
data(Stock)
## Warning in data(Stock): data set 'Stock' not found
# Prediction for the next 5 years using Quadratic model
t = c(115,116,117,118,119)
t2= t^2
# In this step we will use two-step algorithm to predict the data
predict_data = data.frame(t,t2)
# Forecasting of the data
forecast_data = predict(Lmodel_2, predict_data, interval = "prediction")
# Here the interval argument displays the prediction interval printed in "forcasted_data
print(forecast_data)
## fit lwr upr
## 1 -1840647 -2001108 -1680185
## 2 -1873186 -2036488 -1709885
## 3 -1906011 -2072177 -1739844
## 4 -1939120 -2108176 -1770064
## 5 -1972515 -2144486 -1800544
# Plotting the prediction data
plot(Stock, ylim = c(0,120), ylab = "Standardized Residuals", main = "5 years forcasted data for ASX Stocks")
lines(ts(as.vector(forecast_data[,1]), start = 115), col="blue", type="l")
lines(ts(as.vector(forecast_data[,2]), start = 115), col="green", type="l")
lines(ts(as.vector(forecast_data[,3]), start = 115), col="green", type="l")
legend("topright", lty=1, pch=1, col=c("black","green","blue"), text.width = 38,
c("Data","Forecast limits(5%)", "Actual Forecasts"))
The data is first transformed to a time series and examined for 5 valid points, after which the data’s pearson correlation with its lags is computed.The ACF-PACF plot explains the series is seasonal nature while also it indicates that it is non-stationary.Assuming the series is deterministic, we try to fit linear, quadratic,and cosine regression models to it and examine the summary, followed by residual analysis for the four models.Following the research, it is obvious that the quadratic model provides the best fit and aids in the capture of the majority of data for prediction.The forecast for the next five years is calculated using a quadratic model. We may conclude from the deterministic trend modelling and residual analysis determines that the model has a weak deterministic trend and follows a stochastic trend.