Higher Order Multi-Cumulant Factor Analysis in R: Illustration using simulations

Introduction

Factor analysis allows to summarize the commonality in high-dimensional data using a smaller number of variables. The empirical challenge is to compute these factors and to select the number of relevant factors. Standard approaches like PCA only use the covariance. This approach becomes inaccurate in the case of weak factors, namely when the factors only have small explanatory power on response variables. A solution is to use the information in higher orders, such as the cumulants associated to the coskewness and cokurtosis matrix. This is a simulation study to illustrate how the HFA approach (Lu, Huang and Boudt, 2022, https://ssrn.com/abstract=3599632) performs in weak factor models. The hofa package contains an extensive toolkit for factor analysis based on Higher-order multi-cumulant in R. Installing the developer version of the hofa package is possible by

devtools::install_github("GuanglinHuang/hofa")

The proposed [H]igher order multi-cumulant [F]actor [A]nalysis (HFA) approach consists of an generalized eigenvalue ratio (GER) test to select the number of non-Gaussian factors and uses the eigenvector to estimate the factor loadings. In contrast with covariance-based approaches, HFA remains reliable for estimating the non-Gaussian factors in weak factor models.

This R markdown contains three parts: First, we give the data generating process and show how the factors are weak; Second, we compare GER estimator to several existing covariance-based approaches on selecting the number of non-Gaussian factors; Third, we compare HFA estimator to PCA estimator on estimating the weak non-Gaussian factors.

Part 1. Data Generating Process

We use a similar data generating process as in Bai and Ng (2002), Ahn and Horenstein (2013) and Lu et al.(2022). More details can be found in the the documentation of the function hofa.DGP2 of R package hofa. The DGP has the form as follows: \[ x_{it} = \sum_{r=1}^{R}\lambda_{ir}f_{rt} +\sqrt{\theta_i}e_{it}, \quad e_{it} = \sqrt{\frac{1-\rho^2}{1+2J\beta^2}}u^*_{it},\]

\[u^*_{it} = \rho u^*_{it-1}+ u_{it}+\sum\nolimits_{h=\max{(i-J,1)}}^{i-1}\beta u_{ht}+\sum\nolimits_{h=i+1}^{\min{(i+J,N)}}\beta u_{ht},\]

\[f_{jt} = d_{j}f_{jt-1} + v_{jt},\quad v_{jt} \backsim i.i.d. SGT(0,\sigma_j,\eta_j,p_j,q_j),\]

\[u_{it} \backsim i.i.d. \mathcal{N}(0,1), \quad \lambda_{ij} \backsim i.i.d.\mathcal{N}(0,1).\] The parameter \(\rho\) controls the magnitude of the serial correlation and the cross-sectional correlation, which is governed by two parameters: \(\beta\), specifies the magnitude of the cross-sectional correlation, and \(J\), specifies the number of correlated cross-sectional units. The parameter \(\theta_i\) is the variance of each idiosyncratic error. We use the SGT distribution to describe the non-normality of the factors, where the distribution \(SGT(\mu,\sigma,\eta,p,q)\) is the skewed generalized t (SGT) distribution, more details see in the R package sgt.

We set all factors to be non-Gaussian and the number of factors to be 3 (\(R = 3\)). The distribution of the non-Gaussian factors are considered as a skewed-laplace distribution with unit variance (\(\sigma_j = 1, \eta_j=0.8, p_j=1, q_j=\infty\)). We fix the correlation structures of the error terms as \(\rho=0.2, \beta=0.2\), \(J= [N/10]\) and \(\theta_i \backsim U[1,\theta]\), and then change \(\theta\) to control the strong or weak factor model. In addition, the serial correlation coefficient of factors \(d = (0.5,0.2,0)'\) and \(u_{it}\) draw from the normal distribution.

When \(\theta\) is small, for example \(\theta = 1\), as shown in the scree plot, there is a significant gap between the eigenvalues of factors and errors, we thus describe it as a strong factor model.

set.seed(1234)

n = 300
t = 500
R = 3

par_f = list(rep(1,R),rep(0.8,R),rep(1,R),rep(Inf,R)) 
par_e = list(1,0,2,Inf)
rho_f = c(0.5,0.2,0.0001)

theta = 1 
par_cove = list(beta = 0.2,J = n/10,rho = 0.2,msig_e = c(1,theta))

data = hofa::hofa.DGP2(n,t,R,par_f,par_e,par_cove = par_cove,rho_f = rho_f)
X = data$X

m2 = cov(X) 
ev2 = eigen(m2)$values/n
plot(ev2[1:20],type = "b",xlab = "number of eigenvalues", ylab = "eigenvalue",
     main = "Scree plot of covariance matrix (theta = 1)")

However, when \(\theta\) increases, the explanatory power of factors become weak, for example \(\theta = 7\), as shown in the scree plot, the gap is negligible, the factors are weak and cannot be identified.

In HFA, the eigenvalue of \(\tilde{\mathbf{C}}^{(3)}_{x}\tilde{\mathbf{C}}^{(3)'}_{x}\) still can observe a significant gap between third and fourth eigenvalue when \(\theta = 7\).

m3 = hofa::M3M(X)/t^2
ev3 = sqrt(eigen(m3)$values/n^3)

plot(ev3[1:20],type = "b",xlab = "number of eigenvalues", ylab = "eigenvalue",
     main = "Scree plot of third-order multi-cumulant matrix (theta = 7)")

Part 2. Factor Number Selection

We compare covariance-based approaches and GER approach to select the factor number (\(R=3\)). We consider IC3 and PC3 estimator of Bai and Ng (2002), the ON estimator of Onatski (2010), the ER and GR estimators of Ahn and Horenstein (2013), and the ACT estimator of Fan et al. (2020).

First, we show the results in strong factor model, we set \(\theta = 1\) and \(i.i.d.\) errors, we have 500 replications for each method: “Under-estimated”, “True” and “Over-estimated” indicate the percentage of underestimates, truly estimates and overestimates the number of common factors, respectively. “Average” is the average of the estimated number of common factors. As shown in the table, the results show all methods have good performance.

# (not run)
#
# fn_table = matrix(0,4,7)
# row.names(fn_table) = c("Under-estimated","True","Over-estimated","Average")
# colnames(fn_table) = c("ER","GR","BN-IC3","BN-PC3","ON","ACT","GER")
# rep = 500
# theta = 1
# for (jjj in 1:rep) {
#   par_cove = list(beta = 0,J = n/10,rho = 0,msig_e = c(1,theta))
#   data = hofa::hofa.DGP2(n,t,R,par_f,par_e,par_cove = par_cove,rho_f = c(0.5,0.2
#                                                                          ,0.0001))
#   X = data$X
#   
#   fn_ER = hofa::M2.select(X,method = "ER")
#   fn_GR = hofa::M2.select(X,method = "GR")
#   fn_IC3 = hofa::M2.select(X,method = "BN-IC3")
#   fn_PC3 = hofa::M2.select(X,method = "BN-PC3")
#   fn_ON = hofa::M2.select(X,method = "ON")
#   fn_ACT = hofa::M2.select(X,method = "ACT")
#   
#   fn_GER = hofa::M3.select(X,method = "GER3")
#   
#   fn = c(fn_ER$R,
#          fn_GR$R,
#          fn_IC3$R,
#          fn_PC3$R,
#          fn_ON$R,
#          fn_ACT$R,
#          fn_GER$Rh)
#   
#   fn_table[1,fn < R]  = (fn_table[1,fn < R] + 1)
#   fn_table[2,fn == R] = (fn_table[2,fn == R] + 1)
#   fn_table[3,fn > R]  = (fn_table[3,fn > R] + 1)
#   fn_table[4,] = fn_table[4,] + fn
#   
#   fn_table_print = fn_table
#   fn_table_print[1:3,] = round(fn_table[1:3,]/jjj*100,1)
#   fn_table_print[4,] = round(fn_table[4,]/jjj,1)
#   print(fn_table_print)
# }

#                  ER  GR BN-IC3 BN-PC3  ON ACT GER
# Under-estimated   0   0      0      0   0   0   0
# True            100 100    100    100 100 100 100
# Over-estimated    0   0      0      0   0   0   0
# Average           3   3      3      3   3   3   3

In weak factor model, we set \(\theta = 7\), \(\beta = \rho = 0.2\), \(J = [N/10]\), we have 500 replications for each method. The results show covariance-based approaches lose efficiency but GER still has good performance.

#                   ER   GR BN-IC3 BN-PC3   ON ACT  GER
# Under-estimated 29.0 19.2      0      0 46.2   0  0.2
# True            49.4 34.4      0      0 29.2   0 99.8
# Over-estimated  21.6 46.4    100    100 24.6 100  0.0
# Average          3.3  4.8      8      8  3.4   8  3.0

Part 3. Factor Estimation

We compare PCA approach and HFA approach on factor estimation in this section. As a measure of goodness-of-fit, we use the Trace-Ratio (TR) to evaluate how close the estimated values \(\hat{\Lambda}\) and \(\hat{F}\) are to their true values. Taking \(\hat{F}\) as an example, the TR is defined as \[\text{TR}(\hat{F},F) = \frac{\text{tr}((F'\hat{F})(\hat{F}'\hat{F})^{-1}(\hat{F}'F))}{\text{tr}(F'F)}.\]

First, in strong factor model, we set \(\theta = 1\) and \(i.i.d.\) errors, we have 500 replications for each method. The results show both PCA and HFA have good performance.

# (not run)
# theta = 1
# est_table = matrix(0,2,2)
# row.names(est_table) = c("PCA","HFA")
# colnames(est_table) = c("Factor Loading","Factor")
# rep = 500
# 
# for (jjj in 1:rep) {
#   par_cove = list(beta = 0,J = n/10,rho = 0,msig_e = c(1,theta))
#   data = hofa::hofa.DGP2(n,t,R,par_f,par_e,par_cove = par_cove,rho_f = c(0.5,0.2,0.0001))
#   X = data$X
#   EE = data$E
#   FF = data$FF
#   W = data$W
#   X = scale(X,scale = F)
#   
#   ##pca##
#   m2 = cov(X)
#   ev2 = (eigen(m2)$values)/(n)
#   
#   u_pca = eigen(m2)$vectors[,1:R]*sqrt(n)
#   f_pca = X%*%u_pca/n
#   
#   
#   ##hofa##
#   hfa3 = hofa::M3.als(X,rh = R,rg = 0)
#   u_hofa = hfa3$u
#   f_hofa = hfa3$f
#   
#   est_table[1,1] <- est_table[1,1] + hofa::TraceRatio(u_pca,W)
#   est_table[2,1] <- est_table[2,1] + hofa::TraceRatio(u_hofa,W)
#   
#   est_table[1,2] <- est_table[1,2] + hofa::TraceRatio(f_pca,FF)
#   est_table[2,2] <- est_table[2,2] + hofa::TraceRatio(f_hofa,FF)
#   
#   est_table_print = est_table
#   est_table_print = est_table_print/jjj
#   
#   print(est_table_print)
# }

#     Factor Loading    Factor
# PCA      0.9981517 0.9933775
# HFA      0.9959253 0.9933777

In weak factor model, we set \(\theta = 7\), \(\beta = \rho = 0.2\), \(J = [N/10]\), we have 500 replications for each method. We report the trace ratio of estimated factors and corresponding loadings. The results show HFA outperform PCA.

#     Factor Loading    Factor
# PCA      0.9570056 0.9491562
# HFA      0.9837645 0.9794408

\[ \] \[ \] \[ \] \[ \] \[ \]

Higher Order Multi-Cumulant Factor Analysis in R: Illustration using simulations

Guanglin Huang

2022/3/26

Introduction

Part 1. Data Generating Process

Part 2. Factor Number Selection

Part 3. Factor Estimation