Non-robustness issue for estimating the number of factors in high dimensional data

Introduction

In this page, we first provides a systematic review and explanations for the R codes that used in current popular methods for choosing the number of strong and weak factors in high dimensional data. Examples for how to apply those methods in practice are included. Specifically, those methods we reviewed are lists as follow:

Most methods for choosing the number of factors are based on the results from random matrix theory(RMT), which studies the distribution of sample eigenvalues and requires i.i.d and gaussian assumption on the error terms. These restrictions may not appropriate when we want to apply them in practice. In this page, we also show that those methods we reviewed are not robust in the simulation when the error terms in the factor model are serially and cross-sectionally correlated or have non-gaussian distributions.

Basic factor model

Factor analysis based on a model that separates the observed data into an unobserved systematic part (signal part) and an unobserved error part (noise part). The systematic part captures the main information of the data so that we want to separate it from noise part. Specifically, let \(Y_{it}\) be the observed data for the \(i\)-th cross-section unit at time \(t\), for \(i=1,2,\cdots, N\) and \(t=1,\cdots,T\). The factor model for \(Y_{it}\) is given by \[\begin{equation} Y_{it} = L_{i}^{'} F_{t} + e_{it}, \; i = 1,\dots,N, t = 1,\dots, T, \quad \quad \quad (1) \end{equation}\] where \(F_{t}\) is a \((r_{0} \times 1)\) vector of common factors, \(L_{i}\) is a \((r_{0} \times 1)\) vector of loadings associated with \(F_{t}\), and \(e_{it}\) is the idiosyncratic component (noise part) of \(Y_{it}\). The number of true factors in the model is \(r_{0}\). The product of \(L_{i}^{'}F_{t}\) is called the common component (signal part) of \(Y_{it}\). The factors, their loadings, as well as the idiosyncratic errors are not observable. The goal is to estimate the number of factors \(r_{0}\) in the observed data under strong and weak factor assumptions.

Strong and weak factor assumptions

In this section, we briefly go over the strong and weak factor assumptions. Assuming factors \(F_{t}\) and noise \(e_{t}\) are uncorrelated and have zero mean, and normalization \(\mathbb{E}(F_{t}F_{t}^{'}) = I_{r}\) for identification, then the population covariance matrix of the factor model (1) can be expressed as \[\begin{equation} \Sigma_{Y} = LL^{'} + \Sigma_{e}, \quad \quad \quad (2) \end{equation}\] where \(\Sigma_{Y}\) and \(\Sigma_{e}\) are the \(N \times N\) population covariance matrix of \(Y_{t}\) and \(e_{t}\), respectively.

Strong Factor assumption: For (2), we assumed that \(L^{'}L/N \rightarrow \Sigma_{L}\) for some \(r_{0} \times r_{0}\) positive definite matrices \(\Sigma_{L}\) and all the eigenvalues of \(\Sigma_{e}\) are bounded as \(N,T \rightarrow \infty\).

Under this assumption, the top \(r_{0}\) eigenvalues of \(\Sigma_{Y}\) are diverge at the rate \(O(N)\) while the rest of its eigenvalues are bounded as \(N, T \rightarrow \infty\). This is the critical assumption for those methods to consistently estimate the number of strong factors as \(N, T \rightarrow \infty\).

Weak Factor assumption: In contrast to strong factors, for the weak factors, we assumed that \(L^{'}L \rightarrow \Sigma_{L}\) instead of \(L^{'}L/N \rightarrow \Sigma_{L}\) and all the eigenvalues of \(\Sigma_{e}\) are bounded as \(N,T \rightarrow \infty\).

Under this assumption, all the eigenvalues of \(\Sigma_{Y}\) are bounded as \(N,T \rightarrow \infty\) and PCA or MLE estimators for estimating factors and corresponding loadings are not consistent.

Generate weak factors in DGP

Before introducing the methods to estimate the number of strong and weak factors for high dimensional data, let me first introduce how to genernate the weak factors in the DGP and corresonding R codes. Consider the factor model as: \[ Y = X + \Sigma^{\frac{1}{2}}E = \Sigma^{\frac{1}{2}} (\sqrt{T} \hat{U} \hat{D} \hat{V}' + E), \]

where \(\sqrt{T} \hat{U} \hat{D} \hat{V}'\) is the singular value decomposition (SVD) for \(\Sigma^{-\frac{1}{2}}X\) with \(\hat{U} \in \mathbb{R}^{N\times \min (N,T)}\), \(\hat{V} \in \mathbb{R}^{T \times \min (N,T)}\), \(\hat{D}= \text{diag} (\hat{d}_{1}, \hat{d}_{2}, \cdots, \hat{d}_{\min (N, T)})\), \(\hat{U}' \hat{U} = \hat{V}' \hat{V} = I_{\min (N, T)}\), and \(\hat{d}_{1} \geq \hat{d}_{2} \geq , \cdots, \geq \hat{d}_{\min (N, T)}\).

For the weighted signal matrix \(\Sigma^{-\frac{1}{2}}X = \sqrt{T} \hat{U} \hat{D} \hat{V}'\), we can generate the factors with different strength by specified the entries in \(\hat{D}\). Specifically, in our DGP as an example, we generate one strong factor, two useful weak factors, one harmful weak factor and one undetectable weak factor by following the thresholds below:

  • Undetectable factor, \(d^{2}_{i} < \mu_{F}\), the factor is asymptotically undetectable by PCA or MLE based methods.

  • Harmful weak factor, \(\mu_{F} < d^{2}_{i} < \mu_{F}^{*}\), including the factor in the model will make the loss for recovering signal matrix \(X\) larger.

  • Useful weak factor, \(\mu_{F}^{*} < d^{2}_{i} = O(1)\), including the factor will reduce the loss for recovering signal matrix \(X\).

  • Strong factor, \(d^{2}_{i}\) grows proportionally to \(N\).

where \(\mu_{F}=\sqrt{\gamma}\) and \[\mu_{F}^{\star}=\frac{1+\gamma}{2}+\sqrt{\left(\frac{1+\gamma}{2}\right)^{2}+3 \gamma}, \quad \text{for} \quad \frac{N}{T} \rightarrow \gamma\]

are detection and estimation thresholds based on the results from RMT (Random matrix theory). For the error term \(\Sigma^{\frac{1}{2}}E\) in the factor model, we assume homoscedastical noise \(\Sigma= I_{N}\) and \(E=\left(e_{i t}\right)_{N \times T}: e_{it} \stackrel{\text { iid }}{\sim} \mathcal{N}(0,1)\) for simplicity. Based on the factor model and the thresholds to generate weak factors, we can write the R code for our DGP as follow:

Function for generating the orthogonal matrix \(U\) and \(V\):

Example: we can use the R codes above to generate a signal matrix \(X\) with \(N=T=50\) and five factors, in which we have one strong factor, two useful weak factors, one harmful weak factor and one undetectable weak factor:

     [,1] [,2] [,3] [,4] [,5]
[1,]   75  0.0  0.0    0  0.0
[2,]    0  7.5  0.0    0  0.0
[3,]    0  0.0  4.5    0  0.0
[4,]    0  0.0  0.0    2  0.0
[5,]    0  0.0  0.0    0  0.5

1. IC Method

Let \(\hat{L}_{r}^{\mathrm{pc}}\) and \(\hat{F}_{r}^{\mathrm{pc}}\) be the PCA estimators for loadings and factors in the factor model. Define \[\begin{equation} \label{eq:3.2} V(r)=\frac{1}{N T}\left\|Y-\hat{L}_{r}^{\mathrm{pc}} \hat{F}_{r}^{\mathrm{pc}}\right\|_{F}^{2}, \end{equation}\] and the following loss function: \[\begin{equation} \label{eq:3.3} \mathrm{IC}(r) = V(r) + rg(N,T) \quad \text{or} \quad \log (V(r)) + rg(N,T). \end{equation}\] The penalty function \(g(N,T)\) satisfies two condition:

  • \(g(N, T) \rightarrow 0\),
  • \(C_{NT}^2 g(N,T)\) \(\rightarrow \infty\),

as \(N, T \rightarrow \infty\), where \(C_{NT} = \min \{\sqrt{N}, \sqrt{T}\}\).

Define the estimator for the number of factors as

\[\hat{r}_{\mathrm{IC}}=\operatorname{argmin}_{0 \leq r \leq r_{max} } \mathrm{IC}(k).\]

Then the consistency: \(\hat{r}_{\mathrm{IC}} \overset{p}{\rightarrow} r_{0}\) , as \(N,T \rightarrow \infty\), can be established under strong factor assumption.

Based on different forms of the penalty function \(g(N,T)\), the loss function \(IC(r)\) have different forms of criterions specifically:

  • \(IC1(r) = \mathrm{In}(V(r)) + r\;(\frac{N+T}{NT})\;\mathrm{In}(\frac{NT}{N+T})\);
  • \(IC2(r) = \mathrm{In}(V(r)) + r\;(\frac{N+T}{NT})\;\mathrm{In}C_{NT}^2\);
  • \(IC3(r) = \mathrm{In}(V(r)) + r\;(\frac{\mathrm{In}C_{NT}^2}{C_{NT}^2})\);
  • \(AIC3(r) = V(r) + r\; \hat{\sigma}^2\;(2\frac{(N+T -r)}{NT})\);
  • \(BIC3(r) = V(r) + r\; \hat{\sigma}^2\;(\frac{(N+T -r)\mathrm{In}(NT)}{NT})\).

Based on those forms, we can write the R code for the IC method as follow:

IC <- function (X, rmax = NULL, criteria = NULL) {   # set up a function
  
  all(sapply(X, is.numeric) == TRUE) || stop("All columns must be numeric")
  
  is.xts(X) || stop("X must be an xts object so lags and differences are taken properly")
  
  if (!(rmax%%1 == 0)) {
    stop("rmax must be an integer.")
  }
  
  if (is.null(criteria)) {
    warning("criteria is NULL, setting criteria to BIC3")
    criteria <- "BIC3"
  }
  
  T <- dim(X)[1]  # column length of the imput matrix X
  N <- dim(X)[2]  # row length of the imput matrix X
  NT <- N * T
  NT1 <- N + T 
  
  # construct a empty matrix for the vaules generated by penalty function
  CT <- matrix(0, 1, rmax) 
  
  R <- seq(1: rmax) # sequence of r values
  C_NT <- min(N, T) 
  
  # different criterions corresponding to different panelty forms
  if (criteria == "IC1") {      
    CT[1, ] <- log(NT/NT1) * R * NT1/NT
  }
  if (criteria == "IC2") {
    CT[1, ] <- (NT1/NT) * log(C_NT) * R
  }
  if (criteria == "IC3") {
    CT[1, ] <- R * log(C_NT)/C_NT
  }
  if (criteria == "AIC3") {
    CT[1, ] <- 2 * R * (NT1-R)/NT
  }
  if (criteria == "BIC3") {
    CT[1, ] <- log(NT) * R * (NT1-R)/NT
  }
  
  # construct a empty matrix for the values of the loss function IC(r)
  ic <- matrix(0, 1, rmax + 1) 
  
  # construct a empty matrix for the loss function V(r)
  V <- matrix(0, 1, rmax + 1)  
  
  S <- t(X)%*%X                     # covariance matrix of X'X matrix
  eig <- eigen(S)$values            # correspoding eigenvalues
  eig_vec0 <- eigen(S)$vectors      # correspoding eigenvectors
  
  # try different r in the loss function IC(r)
  for (i in 1: rmax) {
    
    F_hat <- sqrt(T)*eig_vec0[, 1:i]     # PCA estimator for factors
    L_hat <- X%*%(F_hat)/T               # PCA estimator for Loadings
  
    
    e_hat = X - L_hat%*%t(F_hat)        # error terms
    
    V[i] <- sum(e_hat * e_hat/N*T)      # average values of loss function V(r) 
    ic[,i] <- log(V[i]) + CT[, i]       # values of the loss function IC(r)
  }
  
  V[rmax + 1] <- sum(X * X/N*T)         # when r=0
  ic[,rmax + 1] <- log(V[rmax + 1])
  
  # select the minimum value of the loss function IC(r)
  ic_hat <- which.min(ic)    
  ic_hat <- ifelse(ic_hat <= rmax, ic_hat * 1, ic_hat * 0)
  
  output <- list(ic_hat = ic_hat, L_hat = L_hat, F_hat = F_hat)
  return(output)
}

The only package we need to load before applying this function is:

Simulation design: consider a simple DGP for generating strong factors with different types of noises as:

\[\begin{equation} \begin{split} &Y_{it} = \sum_{j=1}^{r}\lambda_{ij} F_{tj} + e_{it}, \quad \text{where}\\ & \lambda_{ij}, F_{tj} \stackrel{\text { iid }}{\sim} \mathcal{N}(0,1),\\ &e_{i t} = \rho_{1} e_{i t-1} + (1-\rho_{1}^2)^{1/2} \xi_{it},\\ &\xi_{i t} = \rho_{2} \xi_{i-1, t} + (1-\rho_{2}^2)^{1/2} \epsilon_{it}, \quad \epsilon_{it} \stackrel{\text { iid }}{\sim} \mathcal{N}(0,1). \end{split} \end{equation}\]

We let \(r=5\), and consider the three cases for \(e_{it}\) below:

  • Case I: high serial correlation only, \(\rho_{1} = 0.9\) and \(\rho_{2} = 0\);

  • Case II: high cross-sectional correlation only, \(\rho_{1} = 0\) and \(\rho_{2} = 0.8\);

  • Case III: non-guassion distributions only, \(\rho_{1} = \rho_{2} = 0\). For example, we can chosse gamma distribution for \(e_{it}\) with mean zero and variance 0.5.

The corresponding R code of this DGP is:

S <- 10                   # Number of Simulation
N_set <- c(50, 100)
T_set <- c(50, 100, 200)
NT_comb <- expand.grid(N_set, T_set)  # Combination of N and T pairs

r_hat_IC <- matrix(NA, S, 4)

Factor_IC <- matrix(NA, nrow(NT_comb), 6)
colnames(Factor_IC) <- c("N","T", "white", "serial", "cross", "gamma")

r <- 5                    # Set the true number of factor

for(i in 1:nrow(NT_comb)){
  for(s in 1:S){
    
    # Data generating process
    N <- NT_comb[i,1]
    T <- NT_comb[i,2]
    
    F_0 <-  matrix(rnorm(T*r), T, r)  # Generating factor matrix
    L_0 <-  matrix(rnorm(N*r), N, r)  # Generating loading matrix
    
    # DGP for generating strong factors in white noise
    e_0 <-  matrix(rnorm(N*T), N, T)  
    X <- L_0%*%t(F_0) + e_0   
    
    # DGP for generating strong factors in high serially correlated noise
    rho1 <- 0.8
    e_1 <- matrix(NA, N, T)
    e_1[, 1] <- rnorm(N, 0, 1)
    
    for (t in 1:(T-1)) {
      e_1[, t+1] <- e_1[, t]*rho1 + sqrt(1 - rho1^2)*rnorm(N, 0, 1)
    }
    
    X1 <- L_0%*%t(F_0) + e_1
    

   # DGP for generating strong factors in high cross-sectionally correlated       noise
    rho2 <- 0.8
    e_2 <- matrix(NA, N, T)
    e_2[1,] <- rnorm(T, 0, 1)
    
    for (n in 1:(N-1)) {
      e_2[n+1 , ] <- e_2[n, ]*rho2 + sqrt(1 - rho2^2)*rnorm(T, 0, 1)
    }
  
    X2 <- L_0%*%t(F_0) + e_2
    
  # DGP for generating strong factors in noise with non-guassion distribution   # Take "Gamma" distributed noise as a example
  e_3 <-  matrix(rgamma(N * T, 0.25, scale = 4), nrow = N)
  X3 <- L_0%*%t(F_0) + e_3  
    
  # Apply any one of the criterions such as "IC2" to estimate 
  # the number of strong factors in the DGP:    
  d <- as.Date(1:nrow(X))
  X.xts <- xts(X, order.by = d)
  r_hat_IC[s, 1] <- IC(X.xts, rmax = 16, "IC2")$ic
    
  d <- as.Date(1:nrow(X1))
  X1.xts <- xts(X1, order.by = d)
  r_hat_IC[s, 2] <- IC(X1.xts, rmax = 16, "IC2")$ic
    
  d <- as.Date(1:nrow(X2))
  X2.xts <- xts(X2, order.by = d)
  r_hat_IC[s, 3] <- IC(X2.xts, rmax = 16, "IC2")$ic
    
  d <- as.Date(1:nrow(X3))
  X3.xts <- xts(X3, order.by = d)
  r_hat_IC[s, 4] <- IC(X3.xts, rmax = 16, "IC2")$ic
    
  }
   Factor_IC[i,] <- c(N,T, colMeans(r_hat_IC))
}

The results of the IC method using “IC2” criteria to estimate the number of strong factors with different types of error terms in the DGP are:

       N   T white serial cross gamma
[1,]  50  50     5   13.0  12.8   5.6
[2,] 100  50     5   15.1  12.8   6.0
[3,]  50 100     5   12.6  15.3   6.0
[4,] 100 100     5   14.9  14.6   6.0
[5,]  50 200     5    5.8  16.0   6.0
[6,] 100 200     5    9.1  16.0   6.0

From the results above, we can see that the IC method is not robust when the error terms in the factor model are high serially and cross-sectionally correlated, or have gamma distribution.

Besides, the IC method is designed to estimation the number of factors under strong factor assumption, so it may fail to detect weak factors in the data. To show this, let’s apply the IC method to the DGP that we used to generate weak factors:

The goal is to estimation the number of strong and useful weak factors in the DGP. The results of the IC method using “IC2” criteria to estimate the number of weak factors in the DGP are:

       n   T r_hat_IC
[1,]  50  50      2.8
[2,] 100  50      2.3
[3,]  50 100      2.1
[4,] 100 100      2.0
[5,]  50 200      2.0
[6,] 100 200      1.8

The results above show that the IC method can estimate the number of strong and useful weak factors in the DGP quite well in finte samples, although it is not designed to estimate the number of weak factors.

2. ER Method

The ER estimator is defined as

\[ \hat{r}_{\mathrm{ER}}=\operatorname{argmin}_{0 \leq r \leq r_{max}} \lambda_{r}/\lambda_{r+1},\] with \(\lambda_{0} = \sum_{r=1}^{\min(N,T)}\lambda_{r}/ \log \min(N,T)\).

The intuition for this method to work is very simple: based on strong factor assumption, for any \(j \neq r_{0}\) the ratio \(\lambda_{j}/ \lambda_{j+1}\) converges to \(O(1)\) as \(N, T \rightarrow \infty\), while the the ratio \(\lambda_{r_{0}}/\lambda_{r_{0}+1}\) diverges to infinity.

Based on the equation above, we can write the R code for the ER method as:

Example: using the same DGP for generating strong factors before, we can apply the ER method for estimating the number of strong factors as:

The results of the ER method for estimating the number of strong factors with different types of error terms are:

       N   T white serial cross gamma
[1,]  50  50     5    4.8   4.8   3.3
[2,] 100  50     5    5.0   5.0   1.6
[3,]  50 100     5    5.0   5.0   1.3
[4,] 100 100     5    5.0   5.0   6.0
[5,]  50 200     5    5.0   5.0   0.4
[6,] 100 200     5    5.0   5.0   6.0

From the results above, we can see that the ER method is quite robust when the error terms in the factor model are high serially and cross-sectionally correlated. It is not robust when we have non-gaussian distributed error terms in the factor model, however.

Besides, the ER method is designed to estimation the number of factors under strong factor assumption, so it may fail to detect weak factors in the data. To show this, let’s apply the IC method to the DGP that we used to generate weak factors :

The goal is to estimation the number of strong and useful weak factors in the DGP. The results of the ER method to estimate the number of weak factors in the DGP are:

       n   T r_hat_ER
[1,]  50  50        1
[2,] 100  50        1
[3,]  50 100        1
[4,] 100 100        1
[5,]  50 200        1
[6,] 100 200        1

From the results above, we can see that the ER method fails to detect the number of weak factors in the DGP in finite samples, but it can precisely seprate the number of strong factors from weak ones in the DGP.

3. ED Method

The ED estimator is defined as

\[\hat{r}_{\mathrm{ED}} = \max \left\{r \leq r_{max}: \lambda_{r} - \lambda_{r+1} \geq \delta\right\},\]

where \(\delta\) is some fixed number, \(\lambda_{i}\) is the \(i\)-th largest eigenvalue of \(\hat{\Sigma}_{Y}\). This method estimates the number of factors by exploiting the structure of idiosyncratic terms using the results from RMT. It explicitly allows serial and cross-sectional correlation in the error terms in the assumption.

An advantage of this method comparing with the IC method is that the consistency of the ED estimator can allow for much weaker strength of the factors: instead of growing in the order of \(O(N)\), the smallest eigenvalue of \(L'L\) are just required to diverge in probability as \(N \rightarrow \infty\). The algorithm for choosing \(\delta\) and estimating the number of factors is:

  1. Compute eigenvalues \(\lambda_{1}, \ldots, \lambda_{n}\) of the sample covariance matrix \(X X^{\prime} / T\). Set \(j=r_{\text {max }}+1 .\)
  2. Compute \(\hat{\beta}\), the slope coefficient in the OLS regression of \(\lambda_{j}, \ldots, \lambda_{j+4},\) on the constant and \((j-1)^{2 / 3}, \ldots,(j+3)^{2/3}.\) Set \(\delta=2|\hat{\beta}|\).
  3. Compute \(\hat{r}(\delta) = \max \{\lambda_{i}-\lambda_{i+1} \geq \delta \}\) or if \(\lambda_{i}-\lambda_{i+1} < \delta\) for all \(i \leq r_{\text {max }}^{n},\) set \(\hat{r}(\delta)=0\).
  4. Set \(j=\hat{r}(\delta)+1 .\) Repeat steps 2 and 3 until conver- gence.

Based on this algorithm, we can write the R code for the ED method as:

Example: using the same DGP for generating strong factors before, we can apply the ED method for estimating the number of strong factors with different types of error terms as:

The results of the ED method for estimating the number of strong factors with different types of error terms are:

       N   T white serial cross gamma
[1,]  50  50     5    4.9   4.5   3.5
[2,] 100  50     5    5.0   5.0   6.2
[3,]  50 100     5    5.0   4.9   6.2
[4,] 100 100     5    5.0   5.0   6.1
[5,]  50 200     5    5.0   5.0   6.0
[6,] 100 200     5    5.0   5.0   6.0

From the results above, we can see that the ED method is robust when the error terms in the factor model are high serially and cross-sectionally correlated. It is not robust when we have non-gaussian distributed error terms in the factor model, however.

An advantage of the ED method comparing with the IC and ER methods is that the consistency of the ED estimator can allow for much weaker strength of the factors. To show this, let’s apply the ED method to the DGP that we used to generate weak factors before:

The goal is to estimation the number of strong and useful weak factors in the DGP. The results of the ED method to estimate the number of factors in the DGP are:

       N   T r_hat_ED
[1,]  50  50      2.8
[2,] 100  50      3.4
[3,]  50 100      3.1
[4,] 100 100      3.3
[5,]  50 200      3.3
[6,] 100 200      3.8

From the results above, we can see that the ED method can estimate the number of strong and useful weak factors in the DGP very well in finte samples.

4. NE Method

Instead of assuming \(L^{'}L/N \rightarrow \Sigma_{L}\) for strong factors, it is assumed that \(L^{'}L \rightarrow \Sigma_{L}\) as \(N, T \rightarrow \infty\) for weak factors. The results from random matrix theory (RMT) show that, even for white noise case \(\Sigma_{e} = \sigma^{2}I_{N}\), PCA estimators of the loadings and factors are inconsistent as \(N, T \rightarrow \infty\).

Specifically, there exists a phase transition phenomenon in the limit: if the \(k\)-th largest eigenvalue of population covariance matrix \({\Sigma}_{Y}\) less than the threshold \((\sqrt{N/T}+1) \sigma^{2}\), it has little chance to detect of the \(k\)-th factor using PCA or MLE as \(T, N \rightarrow \infty\). Define the number of detectable factors as \(\# \{ i \leq N : \xi_{i} > (\sqrt{N/T}+1) \sigma^{2} \}\), where \(\xi_{i}\) is the \(i\)-th largest eigenvalue of \({\Sigma}_{Y}\), then one goal is to estimate the number of detectable factors.

The NE method is designed to estimate the number of detectable factors in white noise for high dimensional data using the results from RMT, which studies the distribution of the sample eigenvalues. The idea of the NE method is that they use the distributional properties of the factor-free \((r=0)\) sample eigenvalues to approximate the distributional properties of the \(N-r\) sample eigenvalues in \(\hat{\Sigma}_{Y}\), assuming the number of factors is \(r\) and \(r \ll N\). The NE estimator is defined as \[ \hat{r}_{\mathrm{SE}} =\underset{r}{\arg \min }\left\{\frac{\beta}{4}\left[\frac{T}{N}\right]^{2} t_{r}^{2}\right\}+2(r+1) \text { for } r \in \mathbb{N}: 0 \leq r<\min (N, T), \] where \[ t_{r} =\left[(N-r) \frac{\sum_{i=r+1}^{N} {\lambda}_{i}^{2}}{\left(\sum_{i=r+1}^{N} {\lambda}_{i}\right)^{2}}-\left(1+\frac{N}{T}\right)\right] N- \left(\frac{2}{\beta} - 1\right)\frac{N}{T}. \]

Based on the equation above, we can write the R code for the ER method as:

Example: using the same DGP for generating strong factors before, we can apply the NE method for estimating the number of strong factors with different types of error terms as:

The results of the NE method for estimating the number of strong factors with different types of error terms in the DGP are:

       N   T white serial cross gamma
[1,]  50  50     5   13.1  13.3   5.9
[2,] 100  50     5   18.5  17.1   6.2
[3,]  50 100     5   13.5  15.0   6.0
[4,] 100 100     5   22.4  21.9   6.1
[5,]  50 200     5   10.5  17.0   6.0
[6,] 100 200     5   23.5  25.3   6.0

From the results above, we can see that the NE method is not robust when when the error terms in the factor model are high serially and cross-sectionally correlated, or have non-gaussian distribution.

The NE method is designed to estimate the number of detectable weak factors with white noise in high dimensional data. To show this, we can apply the NE method to the DGP that we used to generate weak factors before:

The goal is to estimation the number of strong and useful weak factors in the DGP. The results of the NE method to estimate the number of factors in the DGP are:

       n   T r_hat_NE
[1,]  50  50      2.4
[2,] 100  50      2.6
[3,]  50 100      2.1
[4,] 100 100      2.0
[5,]  50 200      2.0
[6,] 100 200      2.0

From the results above, we can see that the NE method can estimate the number of strong and useful factors in the DGP quite well in finite samples.

5. BCV Method

Instead of estimating the number of detectable factors, one may prefer estimating the number of useful factors (including strong and useful weak factors). The number of useful factors recover an underlying signal matrix \(X = LF\) in the factor model more precisely than using the true number of factors or detectable factors. The BCV method is designed to estimate the number of useful factors in heteroscedastic noise for high dimensional data based on bi-cross-validation, using randomly held-out submatrices of the data matrix.

The algorithm for recovering the signal matrix \(X\) has two steps. First, they devise early stopping alternation (ESA) method to estimate \(X\) given the number of factors \(r^{*}\). Second, they propose bi-cross-validation (BCV) method to estimate the number of factors \(r^{*}\) based on the ESA method. The idea of BCV method is that, for each candidate \(r\), we first use the three held-in blocks to estimate the held-out block \(X_{00}\) (corresponding to \(Y_{00}\) in the factor model) and then select the optimal \(r^*\) by minimizing the BCV estimated prediction error.

Based on the algorithm, we can write the R code for the BCV method as:

EsaBcv <- function (Y, X = NULL, r.limit = 20, niter = 3, nRepeat = 12,
                    only.r = F, svd.method = "fast", center = F){ 
      
    Y <- as.matrix(Y);
    p <- ncol(Y);
    n <- nrow(Y);
    X.ori <- X;
    
    # {center} logical, whether to add an intercept term in the model.
    if (center)         
    X <- cbind(rep(1, n), X);
    qr.X <- NULL;
    Y.ori <- Y;
    
    if(!is.null(X)) {
        X <- as.matrix(X);
        k <- ncol(X);
        if (k >= n)
            stop("Too many predictors!! The number of predictors 
                 is expected to be much less than the sample size!")
        qr.X <- qr(X);
        Y <- qr.qty(qr.X, Y)[-(1:k), ];
        n <- n-k;
    } 
    
    ## decide the held-in size
    gamma <- p/n;
    bar.gamma <- ((sqrt(gamma) + sqrt(1/gamma))/2)^2;
    sqrt.rho <- sqrt(2)/(sqrt(bar.gamma) + sqrt(bar.gamma + 3));
    held.in.size.small <- min(round(sqrt.rho*sqrt(p * n)), p - 1, n - 1);
    held.in.size.large <- round(sqrt.rho^2 * p * n/held.in.size.small);
    if (n < p) {
        n1 <- held.in.size.small;
        p1 <- held.in.size.large;
    } else {
        n1 <- held.in.size.large;
        p1 <- held.in.size.small;
    }
    max.r <- min(n1, p1);
    if(max.r > r.limit)
        max.r <- r.limit;
    if (is.null(nRepeat)) {
       nRepeat <- max(round(p/p1), round(n/n1), round(p/(p-p1)),
                      round(n/(n-n1)));
    } 

    result.list <- NULL;
    for (i in 1:nRepeat) {
        Y.resample <- Y[sample(1:n, n), sample(1:p, p)]; 
        result.list <- rbind(result.list,sapply(0:max.r, function(r)
          BcvGivenRank(Y.resample, r, niter,n1,p1,svd.method)));
    }
    not.na.result <- which(!is.na(colMeans(result.list)));
    max.r <- sum(not.na.result == 1:length(not.na.result)) - 1;
    result.list <- result.list[, 1:(max.r + 1)];
      colnames(result.list) <- 0:max.r;
    best.r <- as.numeric(which.min(colMeans(result.list)) - 1);

    if (only.r)
    return(best.r)

    est <- ESA(Y.ori, best.r, X.ori, center, niter, svd.method);
    result <- list(best.r = best.r, estSigma = est$estSigma, estU = est$estU,     estD = est$estD, estV = est$estV, beta = est$beta,
        estS = est$estS, mu = est$mu, max.r = max.r); 
      class(result) <- "esabcv"
    return(result);
}

Return a vector of BCV MSE from the four folds for a given rank:

Calculate BCV MSE for one fold:

Calculate the Moore-Penrose pseudo inverse of matrix Y up to a given rank, using moore-penrose pseudo inverse of a matrix:

Estimate the latent factor matrix and noise variance using early stopping alternation (ESA) given the number of factors:

ESA <- function(Y, r, X = NULL, center = F, niter = 3, svd.method = "fast"){

    Y <- as.matrix(Y);
    p <- ncol(Y);
    n <- nrow(Y);
    Y.ori <- Y;
    if (center) 
        X <- cbind(rep(1, n), X);
    qr.X <- NULL;
    if(!is.null(X)) {
        X <- as.matrix(X);
        k <- ncol(X);
        if (k >= n)
            stop("Too many predictors!! The number of predictors 
                 is expected to be much less than the sample size!")
        qr.X <- qr(X);
        beta <- qr.coef(qr.X, Y);
        if (center) {
            mu <- beta[1, ];
            if (k == 1) {
                beta1 <- NULL;
            } else
                beta1 <- beta[-1, ];
        } else {
            mu <- NULL;
            beta1 <- beta;
        }
        Y <- qr.qty(qr.X, Y)[-(1:k), ];
        n <- n-k;
    } else {
        beta1 <- NULL;
        mu <- NULL;
    }
    
    # initializing Sigma
    Sigma <- apply(Y, 2, var); 
    if (r == 0)
        return(list(estSigma = Sigma, estU = NULL, estD = NULL, 
                    estV = NULL, estS = NULL, beta = beta1, mu = mu));
    if (r >= min(p, n)) {
        svd.Y <- ComputeSVD(Y, svd.method = svd.method);
        if (!is.null(X)) {
            estU <- qr.qy(qr.X, rbind(matrix(rep(0, k * r), nrow = k), 
                                      svd.Y$u))
        } else
            estU <- svd.Y$u
        return(list(estSigma = rep(0,p), estU = estU, 
                    estD = svd.Y$d / sqrt(n),
                    estV = svd.Y$v, estS = Y.ori, beta = beta1, mu = mu));   
    } 
    iter <- 0;
    while (1) {
        iter = iter + 1;
        if( iter > niter ){ 
            break;
        }
        Y.tilde <- sapply(1:p, function(i)
                            Y[, i] / sqrt(Sigma[i]));
        svd.Ytilde <- ComputeSVD(Y.tilde, rank = r);
        U.tilde <- svd.Ytilde$u;
        V.tilde <- svd.Ytilde$v;
        estU <- U.tilde;
        estD <- diag(svd.Ytilde$d[1:r], r, r)
        estV <- sqrt(Sigma) * V.tilde;     
        res <- Y - estU %*% estD %*% t(estV);
        Sigma <- apply(res, 2, function(x) sum(x^2)/length(x));
    }
    if (!is.null(X)) 
        estU <- rbind(matrix(rep(0, k * r), nrow = k), estU);
    estS <- estU %*% estD %*% t(estV);
    svd.Y <- ComputeSVD(estS, svd.method, r);
    if (!is.null(X)) {
        estU <- qr.qy(qr.X, svd.Y$u);
        estS <- estS + X %*% beta;
    } else
        estU <- svd.Y$u;
    estD <- svd.Y$d[1:r]/sqrt(n);
    return(list(estSigma = Sigma, estU = estU, estD = estD, 
                estV = svd.Y$v, estS = estS, beta = beta1, mu = mu));    
}

A wrapper for computing SVD:

The package we need to library() before applying the BCV method is:

Example: using the same DGP for generating strong factors before, we can apply the BCV method for estimating the number of strong factors with different types of error terms as:

The results of the BCV method for estimating the number of strong factors with different types of error terms in the DGP are:

       N   T white serial cross gamma
[1,]  50  50   5.1   10.5  10.2   5.1
[2,] 100  50   5.0   14.4  12.1   6.0
[3,]  50 100   5.0   11.9  13.8   6.3
[4,] 100 100   5.0   18.2  17.9   6.0
[5,]  50 200   5.0    9.4  16.7   6.2
[6,] 100 200   5.0   19.0  20.0   6.0

From the results above, we can see that the BCV method is not robust when the error terms in the factor model are high serially and cross-sectionally correlated, or have non-gaussian distribution.

The BCV method is designed to estimate the number of strong and useful weak factors with heteroscedastic noise in high dimensional data. To show this, we can apply the BCV method to the DGP that we used to generate weak factors before:

The goal is to estimation the number of strong and useful weak factors in the DGP. The results of the BCV method to estimate the number of factors in the DGP are:

       n   T r_hat_BCV
[1,]  50  50       2.6
[2,] 100  50       2.7
[3,]  50 100       2.8
[4,] 100 100       2.9
[5,]  50 200       2.8
[6,] 100 200       2.9

From the results above, we can see that the BCV method can estimate the number of strong and useful factors in the DGP very well in finite samples.

Zhenhao Gong

2020-01-15