ARPM query 2019-03-04 - Giles Heywood
Using some simple notation:
PCA expressed as \(R^2\) maximiser: \(argmax <X.e.e'||X>\)
\(\Gamma_{\bar{k}} = e_{1:\bar{k}}.e_{1:\bar{k}}'\) Define gamma
\(X_{systematic} = X.\Gamma_{\bar{k}}\)
As k rises \(\Gamma_{\bar{k}}\) has significant diagonal elements
\(\Gamma_{\bar{k}}=(\Gamma_{\bar{k}}-diag(\Gamma_{\bar{k}}))+diag(\Gamma_{\bar{k}})\)
\(\Gamma_{\bar{k}}^* = (\Gamma_{\bar{k}}-diag(\Gamma_{\bar{k}})).(I-diag(\Gamma_{\bar{k}}))^{-1}\)
\(\Gamma_{\bar{k}}^*\) has zero diagonal so the \(R^2\) test is better specified
suppressWarnings(library(data.table))
library(ggplot2)
#read data
x1 <- 'C:/Users/Giles/Documents/ARPM/databases/global-databases/equities/db_stocks_sp.csv'
x2 <- data.table(read.csv(x1,skip=1))[,-1]
#calculate log returns
x3 <- lapply(lapply(lapply(x2[,which(!unlist(lapply(lapply(x2,is.na),any))),with=F],as.numeric),log),diff)
x4 <- as.matrix(as.data.table(x3))[-1,]
#principal components
x6 <- eigen(cov(x4))
n <- ncol(x4)
#r-squared(k)
kmax <- 80
x9 <- matrix(0,kmax,2)
for(k in 1:kmax) {
gamma <- x6$vectors[,1:k]%*%t(x6$vectors[,1:k])
x7 <- data.table(x=as.numeric(x4%*%gamma),y=as.numeric(x4))
x9[k,1] <- summary(lm(y~x,data=x7))$r.squared
#shift from RHS to LHS
gammax <- (gamma-diag(diag(gamma)))%*%solve(diag(n)-diag(diag(gamma)))
x8 <- data.table(x=as.numeric(x4%*%gammax),y=as.numeric(x4))
x9[k,2] <- summary(lm(y~x,data=x8))$r.squared
}
#munge and plot
x10 <- melt(setnames(data.table(x9),c('raw','de-diag'))[,nfactors:=1:.N],measure.vars=c('raw','de-diag'))[,vbl:=as.factor(variable)]
ggplot(x10,aes(nfactors,value,fill=vbl))+geom_bar(stat='identity', position=position_dodge())+ylab('R-squared')+ggtitle(paste0('max at k=',which.max(x9[,2])))
Although there is not a peak in this adjusted \(R^2\) in this example, there is a practical cutoff around \(\bar{k}\approx20\) which is better defined than in the raw version. Also it could be adjusted for DF or parametrically tested e.g. with t-stat.
This is for order determination, not for rank reduction.
It could be combined with cross-validation ie holding out Xt in leave-one-out or k-fold. This is independent of that.
This is a standard method?
This is reasonable?