An adjustment to Principal Components

Using some simple notation:

PCA expressed as \(R^2\) maximiser: \(argmax <X.e.e'||X>\)

\(\Gamma_{\bar{k}} = e_{1:\bar{k}}.e_{1:\bar{k}}'\) Define gamma

\(X_{systematic} = X.\Gamma_{\bar{k}}\)

Problem

As k rises \(\Gamma_{\bar{k}}\) has significant diagonal elements

RHS has increasing non-systematic risk
\(\Gamma_{\bar{k}}\) is ‘becoming diagonal’
\(R^2\) rises monotonically
No obvious way to choose \(\bar{k}\) other than cross-validation

Alternative to cross-validation?

\(\Gamma_{\bar{k}}=(\Gamma_{\bar{k}}-diag(\Gamma_{\bar{k}}))+diag(\Gamma_{\bar{k}})\)

\(\Gamma_{\bar{k}}^* = (\Gamma_{\bar{k}}-diag(\Gamma_{\bar{k}})).(I-diag(\Gamma_{\bar{k}}))^{-1}\)

\(\Gamma_{\bar{k}}^*\) has zero diagonal so the \(R^2\) test is better specified

Equity example in R

suppressWarnings(library(data.table))
library(ggplot2)

#read data
x1 <- 'C:/Users/Giles/Documents/ARPM/databases/global-databases/equities/db_stocks_sp.csv'
x2 <- data.table(read.csv(x1,skip=1))[,-1]

#calculate log returns
x3 <- lapply(lapply(lapply(x2[,which(!unlist(lapply(lapply(x2,is.na),any))),with=F],as.numeric),log),diff)
x4 <- as.matrix(as.data.table(x3))[-1,]

#principal components
x6 <- eigen(cov(x4))
n <- ncol(x4)

#r-squared(k)
kmax <- 80
x9 <- matrix(0,kmax,2)
for(k in 1:kmax) {
  gamma <- x6$vectors[,1:k]%*%t(x6$vectors[,1:k])
  x7 <- data.table(x=as.numeric(x4%*%gamma),y=as.numeric(x4))
  x9[k,1] <- summary(lm(y~x,data=x7))$r.squared
  
  #shift from RHS to LHS
  gammax <- (gamma-diag(diag(gamma)))%*%solve(diag(n)-diag(diag(gamma)))
  x8 <- data.table(x=as.numeric(x4%*%gammax),y=as.numeric(x4))
  x9[k,2] <- summary(lm(y~x,data=x8))$r.squared
}

#munge and plot
x10 <- melt(setnames(data.table(x9),c('raw','de-diag'))[,nfactors:=1:.N],measure.vars=c('raw','de-diag'))[,vbl:=as.factor(variable)]
ggplot(x10,aes(nfactors,value,fill=vbl))+geom_bar(stat='identity', position=position_dodge())+ylab('R-squared')+ggtitle(paste0('max at k=',which.max(x9[,2])))

Although there is not a peak in this adjusted \(R^2\) in this example, there is a practical cutoff around \(\bar{k}\approx20\) which is better defined than in the raw version. Also it could be adjusted for DF or parametrically tested e.g. with t-stat.

This is for order determination, not for rank reduction.

It could be combined with cross-validation ie holding out Xt in leave-one-out or k-fold. This is independent of that.

Question

This is a standard method?

This is reasonable?