Abstract
We give a majorization algorithm for weighted low-rank matrix approximation, a.k.a. principal component analysis. There is one non-negative weight for each residual. A quadratic programming method is used to compute optimal rank-one weights for the majorization scheme.Note: This is a working paper which will be expanded/updated frequently. All suggestions for improvement are welcome. The directory gifi.stat.ucla.edu/linlow has a pdf version, the complete Rmd file with all code chunks, the R code, and the bib file.
Both subproblems of this criss-cross regression algorithm are weighted linear least squares problems, with a possibly non-sparse matrix of weights. We will go a different route, by concentrating on simplifying the weights.
Majorization algorithms are discussed in general in De Leeuw (1994) and Heiser (1995), and more recently and more extensively in Lange (2016) and De Leeuw (2016). We assume the basic majorization approach is familiar and go straight away to the details.
Our first job is to construct a majorization scheme.
Suppose \(A^{(k)}\) and \(B^{(k)}\) are solutions after \(k\) iterations. Define \[ \hat x_{ij}^{(k)}:=\sum_{s=1}^pa_{is}^{(k)}b_{js}^{(k)}. \] Then \[ \sum_{s=1}^pa_{is}b_{js}=\hat x_{ij}^{(k)}+(\sum_{s=1}^pa_{is}b_{js}-\hat x_{ij}^{(k)}), \] and consequently \[ \sigma(A,B)= \sigma(A^{(k)},B^{(k)})-2\sum_{i=1}^n\sum_{j=1}^mw_{ij}(x_{ij}-\hat x_{ij}^{(k)})(\sum_{s=1}^pa_{is}b_{js}-\hat x_{ij}^{(k)})+ \sum_{i=1}^n\sum_{j=1}^mw_{ij}(\sum_{s=1}^pa_{is}b_{js}-\hat x_{ij}^{(k)})^2. \] If \(c_{ij}\geq w_{ij}\) and \(c_{ij}>0\) we have \[ \sigma(A,B)\leq\sigma(A^{(k)},B^{(k)})-2\sum_{i=1}^n\sum_{j=1}^mw_{ij}(x_{ij}-\hat x_{ij}^{(k)})(\sum_{s=1}^pa_{is}b_{js}-\hat x_{ij}^{(k)})+ \sum_{i=1}^n\sum_{j=1}^mc_{ij}(\sum_{s=1}^pa_{is}b_{js}-\hat x_{ij}^{(k)})^2 \] Define \[ r_{ij}^{(k)}:=\frac{w_{ij}}{c_{ij}}(x_{ij}-\widehat x_{ij}^{(k)}), \] and \[ h_{ij}^{(k)}:=\widehat x_{ij}^{(k)}+\frac{w_{ij}}{c_{ij}}(x_{ij}-\hat x_{ij}^{(k)}). \] Note that \(h_{ij}^{(k)}\) is a convex combination of \(\widehat x_{ij}^{(k)}\) and \(x_{ij}\). Now \[ \sigma(A,B)\leq\sigma(\alpha,\beta)-2\sum_{i=1}^n\sum_{j=1}^mc_{ij}\{r_{ij}^{(k)}\}^2+ \sum_{i=1}^n\sum_{j=1}^mc_{ij}(\sum_{s=1}^pa_{is}b_{js}-\widehat h_{ij}^{(k)})^2 \] In the majorization step we minimize \[ \eta(A,B,A^{(k)},B^{(k)}):=\sum_{i=1}^n\sum_{j=1}^mc_{ij}(\sum_{s=1}^pa_{is}b_{js}-\widehat h_{ij}^{(k)})^2, \] which is a matrix approximation problem similar to our original problem, with adjusted weights and an adjusted target. In the next iteration we define a new target, solve the matrix approximatio problem again, and so on.
It seems that our developments so far have not gained us much. Instead of one matrix approximation problem we now have a sequence of them, all of the same structure. The gain will have to come from the choice of the \(c_{ij}\), which we will try to construct in such a way that the matrix approximation problems in each iteration become more simple.
First suppose \(c_{ij}>0\) and there are \(u>0\) and \(v>0\) such that \(c_{ij}=u_iv_j\). Now \[ g_{ij}^{(k)}:=\sqrt{u_iv_j}\ \widehat h_{ij}^{(k)}, \] and define new variables \(\alpha_{is}=\sqrt{u_i}a_{is}\) and \(\beta_{js}=\sqrt{v_j}b_{js}\). Then \[ \sum_{i=1}^n\sum_{j=1}^mc_{ij}(\sum_{s=1}^pa_{is}b_{js}-\widehat h_{ij}^{(k)})^2=\sum_{i=1}^n\sum_{j=1}^m(\sum_{s=1}^p\alpha_{is}\beta_{js}-g_{ij}^{(k)})^2 \] This shows that the matrix approximation problems that must be solved in each majorization step are unweighted, which means they can be solved rapidly using partial singular value decomposition.
If \(X\) is symmetric our loss function becomes \[ \sigma(A)=\sum_{i=1}^n\sum_{j=1}^n w_{ij}(x_{ij}-\sum_{s=1}^p a_{is}a_{js})^2. \] Now choose \(c_{ij}=u_iu_j\leq w_{ij}\), and basically the same majorization scheme can be applied. The matrix approximation problem in each iteration is now solved by using the first \(p\) eigenvalues of the current target (De Leeuw (1974)).
From an historical point of view the most important special case is that in which \(w_{ij}\) is either one or zero, which zero indicating missing data. It was analyzed by Thomson (1934) in the case of factor analysis, using basically the same alternating least squares algorithm used by Yayes (1933) in factorial experiments.
We derive this algorithm from our majorization perspective by choosing \(c_{ij}=1\), no matter if \(w_{ij}=1\) or \(w_{ij}=0\). Thus \[ r_{ij}^{(k)}=w_{ij}(x_{ij}-\hat x_{ij}^{(k)})=\begin{cases}0&\text{ if }w_{ij}=0,\\x_{ij}-\hat x_{ij}^{(k)}&\text{ if }w_{ij}=1,\end{cases} \] and \[ h_{ij}^{(k)}=\hat x_{ij}^{(k)}+r_{ij}^{(k)}=\begin{cases}\hat x_{ij}^{(k)}&\text{ if }w_{ij}=0,\\x_{ij}&\text{ if }w_{ij}=1.\end{cases} \] This is exactly the Thomson-Yates method for dealing with data that are missing, either by design, by accident, or structurally from properties of the model. The equivalence of the alternating least squares and majorization methods in this case was already observed by Kiers (1997).
Kiers (1997), following some suggestions by Heiser (1995), proposes to use \(c_{ij}=\max_{i=1}^n\max_{j=1}^m w_{ij}\). Thus all elements of \(c_{ij}\) are the same, say equal to \(c\), and \[ h_{ij}^{(k)}=\hat x_{ij}^{(k)}+\frac{w_{ij}}{c}(x_{ij}-\hat x_{ij}^{(k)}). \] This is improved by Groenen, Giaquinto, and Kiers (2003) by using \(c_{ij}=\max_{j=1}^m w_{ij}\), which means that we can write \[ h_{ij}^{(k)}=\hat x_{ij}^{(k)}+\frac{w_{ij}}{c_i}(x_{ij}-\hat x_{ij}^{(k)}). \] Both Kiers (1997) and Groenen, Giaquinto, and Kiers (2003) compare their majorization algorithms with the alternating least squares method of Gabriel and Zamir (1979). The conclusion seems to be that the algorithm of Kiers (1997), which uses a scalar bound for the \(w_{ij}\) is slower than criss-cross regression, while the algorithm of Groenen, Giaquinto, and Kiers (2003), which uses a row-wise bound, is actually faster than criss-cross regression.
It is clear that both Kiers (1997) and Groenen, Giaquinto, and Kiers (2003) use \(c_{ij}\) of the form \(c_{ij}=u_iv_j\) for some simple specific choices of \(u\) and \(v\). We will now look at some general theory for weight majorizations of this form, and see if we can come up with a possible improvement.
The algorithmic map \(\mathcal{A}:\mathbb{R}^{n\times m}\rightarrow\mathbb{R}^{n\times m}\) in the case that \(C=uv'\) is defined as the composition of three maps.
Combining these three maps, all from \(\mathbb{R}^{n\times m}\) to \(\mathbb{R}^{n\times m}\), \[ \mathcal{A}(Z)=\mathcal{S}(\Gamma_p(\mathcal{G}(Z))). \]
The asymptotic error constant of the iterative majorization procedure is the spectral radius of the derivative of the algorithmic map \(\mathcal{A}\). By the chain rule \[ \mathcal{A}(Z+\Delta)=\mathcal{A}(Z)+\mathcal{S}(\mathcal{D}\Gamma_p(\mathcal{G}(Z))\mathcal{H}(\Delta))+\mathcal{o}(\|\Delta\|), \] where \(\mathcal{H}(\Delta)=U^\frac12(E\star\Delta)V^\frac12\) and \(E\) is the matrix with elements \(1-w_{ij}/c_{ij}\).
The derivatives of \(\Gamma_p\) in the rectangular case are computed in De Leeuw (2008). From that paper, using the representation \(\Gamma_p(\mathcal{G}(Z))=\mathcal{G}(Z)L_p^{\ }L_p'\), with \(L_p\) the normalized eigenvectors corresponding with the \(p\) largest eigenvalues of \(\mathcal{G}(Z)'\mathcal{G}(Z)\), \[ \mathcal{D}\Gamma_p(\mathcal{G}(Z))\mathcal{H}(\Delta)=\mathcal{H}(\Delta)L_p^{}L_p'-\mathcal{G}(Z)\{\Xi_p^{}(Z,\Delta)+\Xi_p'(Z,\Delta)\}, \] where \[ \Xi_p(Z,\Delta)=\sum_{s=1}^p (\mathcal{G}(Z)'\mathcal{G}(Z)-\lambda_s^2I)^+(\mathcal{G}(Z)'\mathcal{H}(\Delta)+\mathcal{H}(\Delta)'\mathcal{G}(Z))l_s^{}l_s'. \]
As a consequence of our previous discussions it makes sense to look for \(u\geq 0\) and \(v\geq 0\) such that \(u_iv_j\geq w_{ij}\), but, given the constraints, the matrix with elements \(u_iv_j\) is as small as possible. There are many ways in which we can formalize “as small as possible”. In De Leeuw (2017) convex analysis and duality were used in a more general context. In this paper we use the notion of approximation from above, i.e. we minimize some norm \(\|W-uv'\|\) over \(uv'\geq W\) and \(u,v\geq 0\).
In order to make this a convex problem we take logaritms. Thus we minimize \(\|\log W -(\log u+\log v)\|\) over \(\log u_i + \log v_j\geq\log w_{ij}\) for all \(i\) and \(j\). One-sided discrete linear approximation for general \(\ell_p\) norms is discussed in Watson (1973). We will use least squares on the logarithms, which leads to a quadratic programming problem. But \(\ell_1\) and \(\ell_\infty\) are both interesting as well, and lead to linear programming problems.
Using logarithms only makes sense if \(w_{ij}>0\). For \(w_{ij}=0\) there is an obvious way to proceed, by just skipping those elements. The \(u_i\) and \(v_j\) computed over the non-zero \(w_{ij}\) will automatically satisfy \(u_iv_j>w_{ij}=0\) for the zero weights as well.
Suppose \(\mathcal{M}\) is the set of pairs \((i,j)\) such that \(w_{ij}>0\).To solve minimization of \[
\mathop{\sum\sum}_{(i,j)\in\mathcal{M}}\ (\log w_{ij}-\log u_i-\log v_j)^2
\] over over \(\log u_i + \log v_j\geq\log w_{ij}\) we use the \lsi()
function from the lsei
package (Wang, Lawson, and Hanson (2015)).
The data are crash data on New Zealand roads, from thr VGAM package (Yee (2010)). Data are cross-classified by time (hour of the day) and day of the week, accumulated over 2009. We use crashi
, which are the number of injuries by car.
## Mon Tue Wed Thu Fri Sat Sun
## 0 16 10 22 12 29 55 55
## 1 13 11 15 23 23 42 64
## 2 5 8 16 13 24 37 64
## 3 6 4 8 12 19 31 45
## 4 7 6 11 16 11 35 35
## 5 12 14 14 18 19 27 35
## 6 37 37 32 45 32 21 36
## 7 66 79 92 75 73 40 33
## 8 117 138 132 138 122 59 36
## 9 67 81 68 75 72 59 45
## 10 67 70 62 76 72 84 57
## 11 80 80 50 80 74 114 86
## 12 75 85 86 87 94 101 93
## 13 77 69 84 90 90 105 80
## 14 84 87 98 85 104 103 96
## 15 112 136 134 156 158 120 103
## 16 115 110 138 144 146 106 90
## 17 127 130 140 149 155 104 97
## 18 63 69 91 97 142 83 64
## 19 47 63 53 57 67 69 52
## 20 25 46 62 55 68 70 44
## 21 34 42 49 53 85 62 33
## 22 24 26 35 52 67 54 18
## 23 28 23 20 49 61 69 29
We pretend the data are independent Poisson counts, and we choose weights equal to \(x_{ij}^{-1}\). The optimal rank-one approximation from above is computed wth the program mbound()
in the appendix.
mbound()
are plotted in figure 2.
The wPCA()
function has a bnd
parameter, which can be equal to all
, or col
, or row
, or opt
. If bnd="all"
then \(c_{ij}=\max w_{ij}\), as in Kiers (1997), if bnd="row"
then \(c_{ij}=\max_j w_{ij}\), as in Groenen, Giaquinto, and Kiers (2003). If bnd="col"
then \(c_{ij}=\max_i w_{ij}\) and if bnd="opt"
then we compute the optimal \(u_iv_j\).
First look at \(p=1\). For all four values of the bnd
parameter we find a chi-square equal to 709.9526292976 with \(168-((24+7)-1)=138\) degrees of freedom. The iteration count for bnd="all","col","row","opt"
is, respectively, 208, 151, 21, 17. Because in this example the between-row/within-column variation is so much larger than the within-row/between-column variation, it makes sense that bnd="row"
performs well, almost as bad as bnd="opt"
, while bnd="col"
is bad, almost as bad as bnd="all"
. The one-dimensional plots of \(A\) and \(B\) from \(z=ab'\) are in figure 3.
We also computed the spectral norms of the derivative at the solution, using the functions in lowrank.R
. For bnd="all","col","row","opt"
the convergence rate is, respectively, 0.9710924907, 0.955475149, 0.660381091, 0.6152936489.
bnd
parameter we find chi-square equal to 215.349822881 with \(168-((24+7)*2-4)=110\) degrees of freedom. The iteration count for bnd="all","col","row","opt"
is, respectively, 164, 99, 46, 35.
For bnd="all","col","row","opt"
the convergence rate is, respectively, 0.9715807406, 0.961724624, 0.9042846128, 0.8856193743.
The main question, which we did not answer, is if the decrease in computer time by lowering the number of iterations compensates for the extra computation of the optimal rank-one bounds. Computing the optimal bounds is a quadratic programming problem of size \(nm\), which can be substantial and rapidly becomes impractical for really large data sets. In that case, alternative strategies are needed, for example starting with row or column bounds and performing just a few iterations of an iterative quadratic programming method. Of course we also have not exploited \(\ell_1\) and \(\ell_\infty\) solutions, which have linear programming solutions that are less expensive than the quadratic ones.
Also note that in each majorization step we have solved a low-rank approximation problem. We would also have a convergent algorithm if we replaced the singular value decomposition in each majorization step by one or more alternating least squares iterations. This will generally lead to more majorization iterations, but each majorization iteration becomes less expensive. In addition we can use \(A\) and \(B\) from \(X\approx AB'\) for the previous majorization iteration as starting points for the next one.
There is an interesting recent paper by Szlam, Tulloch, and Tygert (2017), who show that just a few alternating least squares iterations will generally provide a very good low-rank approximation. This implies that it may not pay off to optimize the number of iterations needed for convergence, if convergence is not really of much practical value anyway. It also indicates that performing only a small number of “inner” alternating least squares iterations in each majorization step is probably a good strategy. Again, that is something to be explored.
mbound <- function (w) {
n <- nrow (w)
m <- ncol (w)
ww <- log (w)
g1 <- matrix(0, n * m, n)
g2 <- matrix(0, n * m, m)
k <- 1
for (j in 1:m) {
for (i in 1:n) {
g1[k, i] <- 1
g2[k, j] <- 1
k <- k + 1
}
}
g <- cbind (g1, g2)
f <- log (as.vector (w))
h <- lsi (g, f, g, f)
u <- exp (h[1:n])
v <- exp (h[n + (1:m)])
return (list (u = u, v = v))
}
gfunc <- function (z, x, u, v, w) {
return ((z + (w / outer (u, v)) * (x - z)) * sqrt(outer(u,v)))
}
sfunc <- function (z, u, v) {
return (z / sqrt (outer (u, v)))
}
lfunc <- function (x, p) {
l <- as.matrix(eigen (crossprod(x))$vectors[,1:p])
return (x %*% tcrossprod (l))
}
wPCA <-
function (x,
w,
p = 2,
bnd = "opt",
itmax = 1000,
eps = 1e-6,
verbose = FALSE) {
n <- nrow (x)
m <- ncol (x)
if (bnd == "all") {
u <- rep(1, n)
v <- rep (max (w), m)
}
if (bnd == "row") {
u <- apply (w, 1, max)
v <- rep (1, m)
}
if (bnd == "col") {
u <- rep (1, n)
v <- apply (w, 2, max)
}
if (bnd == "opt") {
uv <- mbound (w)
u <- uv$u
v <- uv$v
}
zold <- lfunc (x, p)
fold <- sum (w * (x - zold) ^ 2)
itel <- 1
repeat {
z1 <- gfunc (zold, x, u, v, w)
z2 <- lfunc (z1, p)
znew <- sfunc (z2, u, v)
fnew <- sum (w * (x - znew) ^ 2)
if (verbose) {
cat (
"iteration ",
formatC (
itel,
digits = 0,
width = 3,
format = "d"
),
" fold",
formatC (
fold,
digits = 6,
width = 10,
format = "f"
),
" fnew",
formatC (
fnew,
digits = 6,
width = 10,
format = "f"
),
"\n"
)
}
if (((fold - fnew) < eps) || (itel == itmax))
break
itel <- itel + 1
zold <- znew
fold <- fnew
}
return (list (z = znew, f = fnew, itel = itel, u = u, v = v))
}
gfunc <- function (z, x, u, v, w) {
return ((z + (w / outer (u, v)) * (x - z)) * sqrt(outer(u,v)))
}
sfunc <- function (z, u, v) {
return (z / sqrt (outer (u, v)))
}
lfunc <- function (x, p) {
l <- as.matrix(eigen (crossprod(x))$vectors[,1:p])
yy <- x %*% tcrossprod (l)
return (yy)
}
ffunc <- function (z, x, p, u, v, w) {
z1 <- gfunc (z, x, u, v, w)
z2 <- lfunc (z1, p)
z3 <- sfunc (z2, u, v)
return (as.vector(z3))
}
cfunc <- function (z, x, p, u, v, w) {
n <- length (u)
m <- length (v)
z <- matrix (z, n, m)
return (ffunc (z, x, p, u, v, w))
}
hfunc <- function (delta, u, v, w) {
((1 - w / outer (u, v)) * delta) * sqrt (outer (u, v))
}
lrPerturb <- function(z, delta, x, p, u, v, w) {
m <- ncol (x)
gz <- gfunc (z, x, u, v, w)
hd <- hfunc (delta, u, v, w)
gh <- crossprod(gz, hd) + crossprod(hd, gz)
hp <- matrix(0, m, m)
e <- eigen(crossprod(gz))
lt <- e$vectors
vt <- e$values
lp <- as.matrix(lt[, 1:p])
for (s in 1:p) {
vi <- ifelse (vt == vt[s], 0, 1 / (vt - vt[s]))
cinv <- lt %*% diag (vi) %*% t(lt)
hp <- hp + cinv %*% gh %*% outer (lp[, s], lp[, s])
}
dz <- hd %*% tcrossprod(lp) - gz %*% (hp + t(hp))
return (sfunc (dz, u, v))
}
lrDerive <- function(z, x, p, u, v, w) {
n <- nrow(z)
m <- ncol(z)
k <- 1
dz <- matrix(0, n * m, n * m)
for (j in 1:m)
for (i in 1:n) {
dx <- matrix(0, n, m)
dx[i, j] <- 1
dz[, k] <- as.vector(lrPerturb(z, dx, x, p, u, v, w))
k <- k + 1
}
return(dz)
}
De Leeuw, J. 1974. “Approximation of a Real Symmetric Matrix by a Positive Semidefinite Matrix of Rank R.” Technical Memorandum TM-74-1229-10. Murray Hill, N.J.: Bell Telephone Laboratories. http://www.stat.ucla.edu/~deleeuw/janspubs/1974/reports/deleeuw_R_74c.pdf.
———. 1984. “Fixed-Rank Approximation with Singular Weight Matrices.” Computational Statistics Quarterly 1: 3–12. http://www.stat.ucla.edu/~deleeuw/janspubs/1984/articles/deleeuw_A_84b.pdf.
———. 1994. “Block Relaxation Algorithms in Statistics.” In Information Systems and Data Analysis, edited by H.H. Bock, W. Lenski, and M.M. Richter, 308–24. Berlin: Springer Verlag. http://www.stat.ucla.edu/~deleeuw/janspubs/1994/chapters/deleeuw_C_94c.pdf.
———. 2008. “Derivatives of Fixed-Rank Approximations.” Preprint Series 547. Los Angeles, CA: UCLA Department of Statistics. http://www.stat.ucla.edu/~deleeuw/janspubs/2008/reports/deleeuw_R_08b.pdf.
———. 2016. Block Relaxation Methods in Statistics. Bookdown. https://bookdown.org/jandeleeuw6/bras/.
———. 2017. “Some Majorization Theory for Weighted Least Squares.” doi:10.13140/RG.2.2.22512.25601.
Eckart, C., and G. Young. 1936. “The Approximation of One Matrix by Another of Lower Rank.” Psychometrika 1 (3): 211–18.
Gabriel, K.R., and S. Zamir. 1979. “Lower Rank Approximation of Matrices by Least Squares with Any Choize of Weights.” Technometrics 21 (4): 489–98.
Groenen, P.J.F., P. Giaquinto, and H.A.L. Kiers. 2003. “Weighted Majorization Algorithms for Weighted Least Squares Decomposition Models.” EI 2003-09. Rotterdam, Netherlands: Econometric Institute, Erasmus University.
Heiser, W.J. 1995. “Convergent Computing by Iterative Majorization: Theory and Applications in Multidimensional Data Analysis.” In Recent Advances in Descriptive Multivariate Analysis, edited by W.J. Krzasnowski. Oxford, England: Clarendon Press.
Keller, J.B. 1962. “Factorization of Matrices by Least Squares.” Biometrika 49: 239–42.
Kiers, H.A.L. 1997. “Weighted Least Squares Fitting Using Iterative Ordinary Least Squares Algorithms.” Psychometrika 62: 251–66.
Lange, K. 2016. MM Optimization Algorithms. SIAM.
Szlam, A., A. Tulloch, and M. Tygert. 2017. “Accurate Low-rank Approximations via a Few Iterations of Alternating Least Squares.” SIAM Jounnal of Matrix Analysis and Applications 38 (2): 425–33.
Thomson, G.H. 1934. “Hotelling’s Method Modfiied to Give Spearman’s g.” Journal of Educational Psychology 25: 366–74.
Wang, Y., C.L. Lawson, and R.J. Hanson. 2015. lsei: Solving Least Squares Problems under Equality/Inequality Constraints. http://CRAN.R-project.org/package=lsei.
Watson, G.A. 1973. “The Calculation of Best Linear One-Sided L_p Approximations.” Mathematics of Computation 27 (123): 607–20.
Yayes, F. 1933. “The Analysis of Replicated Experiments when the Field Results are Incomplete.” Empirical Journal of Experimental Agriculture 1: 129–42.
Yee, T.W. 2010. “The VGAM Package for Categorical Data Analysis.” Journal of Statistical Software 32 (10): 1–34.