This is a stock price analysis report for Toronto-Dominion Bank (TD) from 2000 to 2020. The objective of the report is to provide proper suggestions based two risk measures, Value at Risk and Expected shortfall. The report consists of three parts: 1) univariate data analysis, 2) multivariate data analysis, and 3) time series data analysis.
The data set was downloaded from Yahoo, we loaded the data and plot the time series, to check the stationarity of the data, and found there is an increasing trend which indicated the non-stationarity. In order to better analyze the stock data, we calculated the return rate (current day’s adjusted close divide last day’s adjusted close) and applied log-transformation. In the time series for log-transformation, we could see there two huge fluctuation, and clearly, that was the two specific timeframe that we will discuss later on specifically.
TD <- read.csv(file = 'TD.csv')
plot.ts(TD$Adj.Close);
n=length(TD$Adj.Close);
X=log(TD$Adj[2:n]/TD$Adj[1:n-1]);
plot.ts(X);
Similar to most of the financial data, in the normal QQ plot we found that it is more like a curve than a straight line meaning that it does not meet the normality.
n=length(X);
library(evir);
qqnorm(X);
We did a mean excess plot of the data and compared to the mean excess plot stimulated from Exponential distribution, Weibull distribution, and t distribution. After the comparison, we found our data fit with t-distribution with degree of freedom 3. For further confirmation, we put the sample quantile (data) and true quantile (simulated t with df=3) in a plot, we saw that they are at a good fit because of a straight line showed up.
pvector=seq(1,n,by=1)/(n+1);
true.quantiles=qt(pvector,3); # qt stands for quantiles of t distribution
sample.quantiles=sort(X);
plot(true.quantiles,sample.quantiles);
In the next step, we estimated the mean and variance of the data, we obtained two results, the first one used the degree of freedom gathered by R, the other used the degree of freedom estimated by us. Then we calculated the Value at Risk under three circumstances, 1) using the df provided by R, 2) using the df estimated by us, 3) non-parametric method. Note t-distribution has the Value at Risk in the form $VaR_x(p) = m + s*T^{-1}(1-p) $, where \(T^{-1}\) is the quantile function for t distribution, and we used \(p=0.05\) and \(p=0.01\) for calculation. Appendix (1) provide the sample code for model 1, and model 3 (non-parametric).
VaR <- matrix(c(0.02326914, 0.04524155,0.04727082,0.04727082, 0.02296376,0.07998851),
ncol=2,byrow=TRUE)
colnames(VaR) <- c("VaR(p=0.05)","VaR(p=0.01)")
rownames(VaR) <- c("model 1","model 2","model 3")
VaR <- as.table(VaR)
VaR
## VaR(p=0.05) VaR(p=0.01)
## model 1 0.02326914 0.04524155
## model 2 0.04727082 0.04727082
## model 3 0.02296376 0.07998851
Therefore, we had the following results, since Value at Risk is used to estimate the amount of assets needed to cover the possible losses, and these assets are frozen. We would hence consider the smallest Value at Risk given by the three methods. We saw the non-parametric approach provided the lowest VaR, however, non-parametric approach does not rely on any distributions, it can be applied to any conditions even the parametric is valid. Thus, we consider the parametric approaches first and model 1 is the most appropriate model so far.
Next, we did the VaR backtesting to check the accuracy and effectivness of the given Value at Risk. We provided 2 confidence intervals for each model, 1) 5% significance level and 2) 1% significance level.
We calculated the Violation rate as follow:
Violation <- matrix(c( Violations.rate.1.005, Violations.rate.1.001,
Violations.rate.2.005, Violations.rate.2.001,
Violations.rate.3.005, Violations.rate.3.001),ncol=2,byrow=TRUE)
colnames(Violation) <- c("p=0.05","p=0.01")
rownames(Violation) <- c("model 1","model 2","model 3")
as.table(Violation)
## p=0.05 p=0.01
## model 1 0.0485282418 0.0087509944
## model 2 0.0079554495 0.0079554495
## model 3 0.0499204455 0.0009944312
Table 2 displayed the Confidence Interval for model 1,2,3 with 5% significance level.
Backtesting <- matrix(c( 0.04258019, 0.054456990, 005499005,
0.010408730, 0.04389254, 0.05592849),ncol=2,byrow=TRUE)
colnames(Backtesting) <- c("Lower Bound","Upper Bound")
rownames(Backtesting) <- c("model 1","model 2","model 3")
Backtesting <- as.table(Backtesting)
Backtesting
## Lower Bound Upper Bound
## model 1 4.258019e-02 5.445699e-02
## model 2 5.499005e+06 1.040873e-02
## model 3 4.389254e-02 5.592849e-02
Table 3 displayed the confidence Interval with 1% significance level.
Backtesting2 <- matrix(c( 0.005361492, 0.012137017,0.004722467,
0.011185268, -0.0001522397,0.0021407066),ncol=2,byrow=TRUE)
colnames(Backtesting2) <- c("Lower Bound","Upper Bound")
rownames(Backtesting2) <- c("model 1","model 2","model 3")
Backtesting2 <- as.table(Backtesting2)
Backtesting2
## Lower Bound Upper Bound
## model 1 0.0053614920 0.0121370170
## model 2 0.0047224670 0.0111852680
## model 3 -0.0001522397 0.0021407066
The graph below displayed the violation rate of these models.
plot.ts(X);
lines(VAR.1.005.est,type='l',col='blue')
lines(VAR.1.001.est,type='l',col='red')
lines(VAR.2.005.est,type='l',col='orange')
lines(VAR.2.001.est,type='l',col='purple')
lines(VAR.3.005.est,type='l',col='green')
lines(VAR.3.001.est,type='l',col='yellow')
All values are in the confidence interval, meaning that all three models pass the back testing of Value at Risk. Hence, model 1 is the most appropriate model.
Then we estimated the Expected Shortfall for model 1:
ES.1.005=m1+s1*(3+(qt(1-p1,3))^2)/2*dt(qt(1-p1,3),3)/p1
ES.1.001=m1+s1*(3+(qt(1-p2,3))^2)/2*dt(qt(1-p2,3),3)/p2
ES.1.005;ES.1.001
## [1] 0.03697392
## [1] 0.1449315
Considered TD as a portion of the whole finance industry, in the multivariate comparison, we would like to compare the small portion will the macro-background, so we chose S&P(Standard and Poor’s) to compare with.
SP<- read.csv(file = 'GSPC.csv')
n1=length(SP$Adj.Close)
Y=log(SP$Adj.Close[2:n1]/SP$Adj.Close[1:n1-1])
We compared the time series and the normal QQ plot of TD and S&P
n1=length(Y);
par(mfrow=c(2,2))
library(evir);library(MASS);
plot.ts(X);plot.ts(Y);
qqnorm(X);qqnorm(Y)
Their shape is more likely the same, both of them do not meet normality, and data from TD shows a wider range (-0.15,0.15), meaning that their fluctuation seems to be larger. Therefore, we concluded that the data from S&P also follows a t-distribution with 3 degree of freedom.
We applied MASS package again to estimate the parameters from TD and S&P, and obtained
fitdistr(X,'t')
## m s df
## 0.0007116962 0.0093570729 2.8345144204
## (0.0001626059) (0.0001737758) (0.1320291483)
fitdistr(Y,'t')
## m s df
## 0.0005857715 0.0069029154 2.5314693922
## (0.0001209456) (0.0001342558) (0.1131139323)
Note that, the copulas theorem demonstrates that two marginal data sets with same distribution do not guarantee that their bivariate distribution will have the same distribution, however, in our case, since the marginal degree of freedom from two data are quiet similar (2.83 and 2.53), and knowing that bivariate t-distribution forces all marginals to have the same degree of freedom, we can fit two models into bivariate t. By using ‘mvtnorm’ and ‘QRM’ obtained the following:
we obtained the mean vector
library(mvtnorm);library(QRM)
Z=cbind(X,Y)
fit<-fit.mst(Z); #fit biv t distr
mu.fit<-fit$mu; #estimated location vector
Sigma.fit<-as.matrix(fit$Sigma); #estimated scale matrix
df.fit <-fit$df; #estimated df
mu.fit
## X Y
## 0.0009214342 0.0005083723
the auto covariance vector
Sigma.fit
## X Y
## X 8.664308e-05 -3.178662e-06
## Y -3.178662e-06 5.039172e-05
the degree of freedom
df.fit
## [1] 2.694494
The degree of freedom is close to the marginal df, which shows multivariate t is a good fit. In order to demonstrate bivariate t is the best option, we conducted the ‘Goodness of fit’ test between multivariate normal and multivariate t. We found the 3d density plot for our data and multivariate t are quiet similar, so as the contour plot. Therefore, multivariate t is the best option.
n=length(Z);
Znorm <- rmvnorm(n,mean=mu.fit,sigma=Sigma.fit*df.fit/(df.fit-2),method='chol');
Zt <- rmvt(n,sigma=Sigma.fit,df=df.fit);
Bivnorm.kde <- kde2d(Znorm[,1], Znorm[,2], n = 50);
Bivt.kde <- kde2d(Zt[,1], Zt[,2], n = 50);
Biv.kde <- kde2d(X, Y, n = 50);
par(mfrow=c(1,3));
persp(Biv.kde,phi=45, theta=30, shade=.1, border = NA);
persp(Bivnorm.kde, phi=45, theta=30, shade=.1, border = NA);
persp(Bivt.kde, phi = 45, theta = 30, shade = .1, border = NA);
par(mfrow=c(2,3));
contour(Biv.kde); contour(Bivnorm.kde); contour(Bivt.kde)
Then we calculated the Value at Risk for bivariate case. since both of the marginals are t-distribution, we can represent our data \((X_1, X_2)\) as \((\mu_1, \mu_2) + \sqrt{W}\times(Z_1,Z_2)\) where \((Z_1, Z_2)\) is multivariate normal with the scale matrix (again, that covariance matrix for the normal part is the scale matrix for the bivariate t part. Hence we obtained:
newalpha=df.fit
VAR=mu.fit[1]+mu.fit[2]+sqrt(Sigma.fit[1,1]+Sigma.fit[2,2]+2*Sigma.fit[1,2])*qt(1-p1,newalpha);
n=length(X);
Combined=X+Y;
Index=seq(1,n,by=1);
VAR.line=c(rep(VAR,n))
par(mfrow=c(1,1));
plot(Index,Combined,type="l");
points(Index,VAR.line,col="blue",type="l");
length(Combined[Combined>VAR])/n
## [1] 0.04514718
As above, we applied log transformation on the return rate. But we calculated the difference this time (to compared with part 1, we could ignore the difference).
In univariate analysis, we made a strong assumption that the data is independent and identically distributed, but in most of the case, financial data has dependency. Indeed, in part one we checked the non-normality; by checking the ACF of our data, and compared to the ACF of our Squared data, we observed weak correlation in normal ACF but observed strong correlation in the squared ACF, meaning that the data is not independent identically distributed. Hence using time series analysis should be more appropriate.
library(tseries)
Q=diff(X)
par(mfrow=c(1,4))
plot.ts(Q)
acf(Q);acf(Q^2);pacf(Q^2)
Then we fit the data with ARCH(1), and calculated the two coefficients \(a_0=0.0002042\), \(a_1= 0.597489\), since a1 is smaller than 1, this is a stationary model. Hence we continued our calculation and extracted the fitted value from ARCH(1) model. The following is the volatility plot, we observed two high volatile part corresponding to the 2008 financial crisis and current COVID-19 impact.
fit.ARCH1<-garch(Q,order=c(0,1))
Coefficients<-fit.ARCH1$coef
alpha0=Coefficients[1]
alpha1=Coefficients[2]
past.prediction=fit.ARCH1$fitted.values
# Compare with
n=length(Q)
sigmat=past.prediction[2:n]
par(mfrow=c(1,1))
plot.ts(sigmat^2)
Without difference:
Q=(X)
fit.ARCH1<-garch(Q,order=c(0,1))
Coefficients<-fit.ARCH1$coef
alpha0=Coefficients[1]
alpha1=Coefficients[2]
past.prediction=fit.ARCH1$fitted.values
n=length(Q)
sigmat=past.prediction[2:n]
par(mfrow=c(1,1))
plot.ts(sigmat^2)
Now we did the back testing and calculated the Value at Risk. In both 5% and 1% significance level, violation rate passed the Back testing
Hence the prediction of the next Value at Risk will is calculated as below:
next.VAR.005=sqrt(alpha0+alpha1*Q[n]^2)*qnorm(1-p1)+mean(Q)
next.VAR.001=sqrt(alpha0+alpha1*Q[n]^2)*qnorm(1-p2)+mean(Q)
next.VAR.005; next.VAR.001
## a0
## 0.1634653
## a0
## 0.2310014
If we would like to compare with results in part 1, we did the time series analysis again but not calculating the difference and we get
Since the data is not independent identically distributed, the results from time series should be more appropriate.