Introduction

The aim of this project is to determine the most suitable trend model for an AUD100 dataset representing the returns of a share market trader’s investment portfolio. Various trend models including linear, quadratic, cosine, cyclical, and seasonal trends will be explored, and the best fitting model identified. Additionally, this model will be used to predict the returns for the next 5 trading days.

Data Exploration

The dataset comprises returns, denominated in AUD100, obtained from a share market trader’s investment portfolio. It consists of 179 observations collected over consecutive trading days within a single year, representing the performance of the portfolio during that period.

Descriptive Statistics and Time Series Plot

aud100_data <- read_csv("assignment1Data2024.csv")
## New names:
## Rows: 179 Columns: 2
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," dbl
## (2): ...1, x
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
head(aud100_data)
## # A tibble: 6 × 2
##    ...1     x
##   <dbl> <dbl>
## 1     1  150.
## 2     2  147.
## 3     3  144.
## 4     4  141.
## 5     5  140.
## 6     6  139.
colnames(aud100_data)<-c('day','x')
summary(aud100_data)
##       day              x          
##  Min.   :  1.0   Min.   :-49.167  
##  1st Qu.: 45.5   1st Qu.: -2.685  
##  Median : 90.0   Median : 51.105  
##  Mean   : 90.0   Mean   : 57.043  
##  3rd Qu.:134.5   3rd Qu.:117.781  
##  Max.   :179.0   Max.   :214.611
sd(aud100_data$x)
## [1] 68.3587
aud100.ts = ts(aud100_data$x, start=1,frequency = 1)
plot(aud100.ts, type='o', pch=20 ,ylab="Aud100",xlab="Time",main ="AUD100 Time Series Plot")

On average the return value is approximately 57.043, with the highest return at 214 and the lowest at -49. The standard deviation is approximately AUD68.36, indicating the typical amount of fluctuation around the mean return.

Trend: There is a decreasing trend until the the first 100 trading days and then and upward trend for the remaining trading days Seasonality: The repeating pattern throughout the time series suggests presence of seasonality. Behavior: Presence of seasonality poses challenges to determine the behaviour.However, the series suggest an auto regressive behavior.

Changing Variance: some changing variance can be observed nearly after 125 days. Intervention Point: no intervention point is observed.

Scatter plot

plot(y=aud100.ts,x=zlag(aud100.ts),
ylab='AUD100', xlab='AUD100 in the previous day',main= "Scatter plot of AUD100 values")

y=aud100.ts
x=zlag(aud100.ts)
index = 2:length(x) # Create an index to get rid of the first NA value in x
cor(y[index],x[index])
## [1] 0.9868369

Scatter plot shows a strong positive linear correlation (r=0.986) between the AUD100 previous and the next the day.

ACF Plot

Aud100series consists of daily share values, to identify a frequency on the series, ACF plot is investigated. This frequency is then applied in developing seasonal trend models which are harmonic and cosine trend models.

acf(aud100.ts, main = " ACF for AUD100 series", lag.max = 40 )

A wave like pattern is observed suggesting seasonality. Approximately every 5th/6th lag marks the end of an upward or downard trend on plot. Furthermore, the collected data is from the open trading days (monday-friday). Hence for analysis of seasonal and cosine model a frequency of 5 is used and each week is interpreted as 5days.

aud100.ts = ts(aud100_data$x, start=1,frequency = 5)
plot(aud100.ts,ylab="Aud100",xlab="Time",main ="AUD100 Time Series Plot")
points(aud100.ts,x=time(aud100.ts), pch=as.vector(season(aud100.ts, 1:5)))

The plot above shows points with frequency of 5, although there is a wave like pattern, the data points do not peak on the same days(frequency) suggesting no seasonal pattern in the series.

Data Modelling and Residual Check Analysis

Linear Trend Model

aud100.ts = ts(aud100_data$x, start=1,frequency = 1)
t <- time(aud100.ts)
t
## Time Series:
## Start = 1 
## End = 179 
## Frequency = 1 
##   [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18
##  [19]  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36
##  [37]  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54
##  [55]  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72
##  [73]  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90
##  [91]  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107 108
## [109] 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126
## [127] 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144
## [145] 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162
## [163] 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179
model1 <- lm(aud100.ts~t)
summary(model1)
## 
## Call:
## lm(formula = aud100.ts ~ t)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -104.186  -58.689   -5.424   57.532  172.072 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 72.22086   10.20622   7.076 3.33e-11 ***
## t           -0.16865    0.09835  -1.715   0.0881 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 67.99 on 177 degrees of freedom
## Multiple R-squared:  0.01634,    Adjusted R-squared:  0.01079 
## F-statistic: 2.941 on 1 and 177 DF,  p-value: 0.08812
plot(aud100.ts, ylab='AUD100', xlab='Time', type='o', pch=20,
     main='Time series plot AUD100 series')
abline(model1, col="red")

The plot shows linear trend model is unable to capture well the general trend in the series.

From the linear trend model summary we can observe the p-value is 0.08812 > 0.05 therefore the model is not significant at 5%. likewise, based on the R squared value of model only explains 1% of variation in the model.

Residual Analysis of linear model

y=rstudent(model1)
par(mfrow = c(2, 2))
plot(y,x=as.vector(time(aud100.ts)),xlab='Time', ylab='AUD100', type='o', main ='Time Series of standardized residuals of linear model',cex.main = 0.8 )
hist(y, xlab= 'Standardized Residuals', main= 'Histogram of standardized Residuals of linear model', cex.main = 0.8)
qqnorm(y,main='QQplot of standardized residuals of linear model',cex.main = 0.8)
qqline(y, col = 2, lwd = 1, lty = 2)
acf(y, main = "ACF of standardized residuals of linear model.",cex.main = 0.8)

shapiro.test(y)
## 
##  Shapiro-Wilk normality test
## 
## data:  y
## W = 0.9525, p-value = 1.022e-05
  • The time series plot of the the residual is not random pattern. A trend can be observed, this suggests that it may not be a competent model.

  • The histogram is slightly left skewed, with majority of the values between -1.5 and -0.5.

  • Majority of the data points in the beginning and end of the standardized residual deviate from straight line pattern, this indicates departure from normality in the distribution of the residuals.

  • ACF plot shows significant correlations at all lags, this suggests there is still information left in the residuals which the model has not accounted for.

  • Shapiro-Wilk normality test gives a p-value of 1.022e-05 therefore we reject H0:the data are normally distributed.

Quadratic Trend Model

t <- time(aud100.ts)
t2 = t^2

model2 <- lm(aud100.ts~t+t2)
summary(model2)
## 
## Call:
## lm(formula = aud100.ts ~ t + t2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -59.258 -19.321   0.571  19.992  53.165 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.141e+02  5.958e+00   35.93   <2e-16 ***
## t           -4.871e+00  1.528e-01  -31.87   <2e-16 ***
## t2           2.612e-02  8.224e-04   31.77   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 26.27 on 176 degrees of freedom
## Multiple R-squared:  0.8539, Adjusted R-squared:  0.8523 
## F-statistic: 514.4 on 2 and 176 DF,  p-value: < 2.2e-16
plot(ts(fitted(model2)), ylab='AUD100', xlab='Time',
     main='Fitted Quadratic curve to AUD100 series',ylim = c(min(c(fitted(model2), as.vector(aud100.ts))) ,
max(c(fitted(model2), as.vector(aud100.ts)))
),col="red" )
lines(as.vector(aud100.ts),type="o", pch=20)

The plot shows the quadratic trend is able to capture the general trend better than linear model.

From the Quadratic trend model summary we can observe the p-value is 2.2e-16 < 0.05 therefore the model significant at 5% level, similarly the coefficients t, t2 are also significant at 5% level. Finally based on the R squared value of model explains 85% of variation in the model.

Residual Analysis of Quadratic Model

y=rstudent(model2)
par(mfrow = c(2, 2))
plot(y,x=as.vector(time(aud100.ts)),xlab='Time', ylab='AUD100', type='o', main ='Time Series of standardized residuals of linear model',cex.main = 0.8 )
hist(y, xlab= 'Standardized Residuals', main= 'Histogram of standardized Residuals of linear model', cex.main = 0.8)
qqnorm(y,main='QQplot of standardized residuals of linear model',cex.main = 0.8)
qqline(y, col = 2, lwd = 1, lty = 2)
acf(y, main = "ACF of standardized residuals of linear model.",cex.main = 0.8)

shapiro.test(y)
## 
##  Shapiro-Wilk normality test
## 
## data:  y
## W = 0.98397, p-value = 0.03799
  • The time series plot of the the residual still shows some presence of trend, however the mean of the residual much closer to zero than off the linear trend residual series, indicating it as a better model for the series.

  • The histogram is normally distributed with majority of the values between -1 and 1.

  • Data points of the standardized residual closely follow the straight line pattern, this indicates normality in the distribution of the residuals.

  • ACF plot shows significant correlations for most of the lags, this suggests there is still information left in the residuals which the model has not accounted for.

  • Shapiro-Wilk normality test gives a p-value of 0.03799 therefore we reject H0:the data are normally distributed.

Cyclical/Seasonal Trend Model

Frequency of 5 is used to model the Seasonal model.

Seasonal model without intercept

aud100.ts = ts(aud100_data$x, start=1,frequency = 5)
day. <- season(aud100.ts,1:5)
day.
##   [1] 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2
##  [38] 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4
##  [75] 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1
## [112] 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3
## [149] 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4
## Levels: 1 2 3 4 5
model3 <- lm(aud100.ts ~ day. -1)
summary(model3)
## 
## Call:
## lm(formula = aud100.ts ~ day. - 1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -106.785  -60.423   -5.695   62.046  156.912 
## 
## Coefficients:
##       Estimate Std. Error t value Pr(>|t|)    
## day.1    57.70      11.52   5.008 1.34e-06 ***
## day.2    57.62      11.52   5.001 1.38e-06 ***
## day.3    57.83      11.52   5.020 1.27e-06 ***
## day.4    57.78      11.52   5.015 1.30e-06 ***
## day.5    54.21      11.68   4.639 6.85e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 69.13 on 174 degrees of freedom
## Multiple R-squared:  0.4121, Adjusted R-squared:  0.3952 
## F-statistic: 24.39 on 5 and 174 DF,  p-value: < 2.2e-16
plot(ts(fitted(model3)),type='o',xlab='Time', ylab='AUD100',ylim = c(min(c(fitted(model3), as.vector(aud100.ts))), 
              max(c(fitted(model3), as.vector(aud100.ts)))),
     main = "Fitted seasonal model to AUD series.", lty=2, col="red")
lines(as.vector(aud100.ts),type="o")

The plot shows the seasonal trend model without intercept is unable to capture the general trend.

From the Seasonal without intercept trend model summary we can observe the p-value is 2.2e-16 < 0.05 therefore the model significant at 5% level, similarly the coefficients day.1-day.5 are also significant at 5% level. However based on the R squared value, the model only explains 39.5% of variation in the model.

Seasonal trend model with intercept

model3.1 <- lm(aud100.ts ~ day.)
summary(model3.1)
## 
## Call:
## lm(formula = aud100.ts ~ day.)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -106.785  -60.423   -5.695   62.046  156.912 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 57.69900   11.52090   5.008 1.34e-06 ***
## day.2       -0.08101   16.29301  -0.005    0.996    
## day.3        0.13368   16.29301   0.008    0.993    
## day.4        0.07842   16.29301   0.005    0.996    
## day.5       -3.49221   16.40898  -0.213    0.832    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 69.13 on 174 degrees of freedom
## Multiple R-squared:  0.0004218,  Adjusted R-squared:  -0.02256 
## F-statistic: 0.01835 on 4 and 174 DF,  p-value: 0.9993
plot(ts(fitted(model3)),type='o',xlab='Time', ylab='AUD100',ylim = c(min(c(fitted(model3), as.vector(aud100.ts))), 
              max(c(fitted(model3), as.vector(aud100.ts)))),
     main = "Fitted seasonal model to AUD series.", lty=2, col="red")
lines(as.vector(aud100.ts),type="o")

The plot shows the seasonal trend model with intercept is similar to seasonal model without intercept and is unable to capture the general trend.

From the Seasonal intercept trend model summary we can observe the p-value is 0.9993 > 0.05 therefore the model is not significant at 5% level and based on the R squared value, the model only explains -2% of variation in the model.

Residual Analysis of Seasonal Trend

y=rstudent(model3)
par(mfrow = c(2, 2))
plot(y,x=as.vector(time(aud100.ts)),xlab='Time', ylab='AUD100', type='o', main ='Time Series of standardized residuals of seasonal model',cex.main = 0.8 )

points(y,x=as.vector(time(aud100.ts)))
       
hist(y, xlab= 'Standardized Residuals', main= 'Histogram of standardized Residuals of seasonal model', cex.main = 0.8)
qqnorm(y,main='QQplot of standardized residuals of seasonal model',cex.main = 0.8)
qqline(y, col = 2, lwd = 1, lty = 2)
acf(y, main = "ACF of standardized residuals of seaspnal model.",cex.main = 0.8)

shapiro.test(y)
## 
##  Shapiro-Wilk normality test
## 
## data:  y
## W = 0.94757, p-value = 3.631e-06
  • The time series plot of the the residual is not random pattern. A trend can be observed, this suggests that it may not be a competent model for forecasting

  • The histogram is slightly left skewed, with majority of the values between -1.5 and -0.5.

  • Majority of the data points in the beginning and end of the standardized residual deviate from straight line pattern, this indicates departure from normality in the distribution of the residuals.

  • ACF plot shows significant correlations at all lags, this suggests there is still information left in the residuals which the model has not accounted for.

  • Shapiro-Wilk normality test gives a p-value of 3.631e-06 therefore we reject H0:the data are normally distributed.

Cosine Trend Model

har. <- harmonic(aud100.ts, 1) # calculate cos(2*pi*t) and sin(2*pi*t)
model4 <- lm(aud100.ts ~ har.)

summary(model4)
## 
## Call:
## lm(formula = aud100.ts ~ har.)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -107.278  -59.716   -6.408   61.674  158.081 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      57.0348     5.1379  11.101   <2e-16 ***
## har.cos(2*pi*t)  -0.5054     7.2496  -0.070    0.945    
## har.sin(2*pi*t)   1.2955     7.2826   0.178    0.859    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 68.74 on 176 degrees of freedom
## Multiple R-squared:  0.0002069,  Adjusted R-squared:  -0.01115 
## F-statistic: 0.01821 on 2 and 176 DF,  p-value: 0.982
plot(ts(fitted(model4)),type='o',xlab='Time', ylab='AUD100',ylim = c(min(c(fitted(model4), as.vector(aud100.ts))), 
              max(c(fitted(model4), as.vector(aud100.ts)))),
     main = "Fitted cosine model to AUD series.", lty=2, col="red")
lines(as.vector(aud100.ts),type="o")

The plot shows the cosine trend model is unable to capture the general trend.

From the cosine trend model summary we can observe the p-value is 0.982 > 0.05 therefore the model is not significant at 5% level, and based on the R squared value, the model only explains < 1% of variation in the model.

Residual analyis of Cosine Model

y=rstudent(model4)
par(mfrow = c(2, 2))

plot(y,x=as.vector(time(aud100.ts)),xlab='Time', ylab='AUD100', type='o', main ='Time Series of standardized residuals of cosine model',cex.main = 0.8 )

points(y,x=as.vector(time(aud100.ts)))
       
hist(y, xlab= 'Standardized Residuals', main= 'Histogram of standardized Residuals of cosine model', cex.main = 0.8)
qqnorm(y,main='QQplot of standardized residuals of cosine model',cex.main = 0.8)
qqline(y, col = 2, lwd = 1, lty = 2)
acf(y, main = "ACF of standardized residuals of cosine model.",cex.main = 0.8)

shapiro.test(y)
## 
##  Shapiro-Wilk normality test
## 
## data:  y
## W = 0.94756, p-value = 3.628e-06
  • The time series plot of the the residual is not random pattern. A trend can be observed, this suggests that it may not be a competent model for forecasting.

  • The histogram is slightly left skewed, with majority of the values between -1.5 and -0.5.

  • Majority of the data points in the beginning and end of the standardized residual deviate from straight line pattern, this indicates departure from normality in the distribution of the residuals.

  • ACF plot shows significant correlations at all lags, this suggests there is still information left in the residuals which the model has not accounted for.

  • Shapiro-Wilk normality test gives a p-value of 3.631e-06 therefore we reject H0:the data are normally distributed.

Forecasting

The quadratic model is the most appropriate model for forecasting based on the summary of the model and residual analysis. The model was able to capture trend in the series better than all other models. Likewise, the adjusted R-squared value explained 85% of the variability which is high and not overfitting.

Furthermore, the standardized residual histogram and qqplot suggest normality, the time series of the residual plot had the least trend in comparison to other models.

aud100.ts = ts(aud100_data$x, start=1,frequency = 1)
h <- 5
t <-time(aud100.ts)
t <- seq((length(t)+1), (length(t)+h), 1)
t2 <- t^2

aheadTimes <- data.frame(t,t2)
frcModel2 <- predict(model2, newdata = aheadTimes, interval = "prediction")

plot(aud100.ts, xlim= c(1,190), ylim = c(-60,300),
ylab = "aud100 series",
main = "Forecasts from the quadratic model fitted to
the aud 100 series.")
lines(ts(as.vector(frcModel2[,3]), start = 180), col="blue", type="l")
lines(ts(as.vector(frcModel2[,1]), start = 180), col="red", type="l")
lines(ts(as.vector(frcModel2[,2]), start = 180), col="blue", type="l")
legend("topleft", lty=1, bty = "n",pch=20, col=c("black","blue","red"),
text.width = 18,
c("Data","5% forecast limits", "Forecasts"))

The plot shows the an upward trend for the next 5 days based on the quadratic trend model.

Conclusion

In total 4 models were analysed, the adjusted R-squared value and residual analysis were the main metrics used to identify the most appropriate model for forecasting the price of aud100 for the next 5 days. Furthermore, Quadratic trend model was the most appropriate model with the highest adjusted R-squared value of 85% and able to capture trend better than other models. However, the residual analysis still had some presence of trend in the series and most lags were significant indicating there is still information left in the residuals which the model has not accounted for. Therefore, while the Quadratic trend model may provide reasonable predictions, its accuracy may be constrained by unexplained residual variation, potentially limiting the reliability of the forecast.

References

[1]Demirhan, H 2024, lecture and lab notes, Time Series, RMIT University, Melbourne