Hebbian-Based Principal Component Analysis

This work implements Hebbian-based PCA implemented as a simple feedforward one-layer neural network. It has m inputs that corresponds to the number of features of our original data. There are l outputs (that are linear neurons) that correspond to the number of components we want to extract.

Load data

I took iris dataset not because it’s small but due to limitations of my personal computer. Since we want to get back all the components, it will require 10^(number of features) epochs to converge. Iris has only 4 dimensions and it consumes significant amount of CPU time to converge (10^4 epochs).

Preprocess data

The preprocessing step includes centering the data. It’s also reasonable to encode the data on the hypershere. However, the algorithm converges both ways to the values very close to the pricipal component obtained using classical PCA.

normalize <- function(M) {
        #center data
        means = apply(M,2,mean)
        Xnorm = t(apply(M,1,function(x) {x-means}))
        Xnorm
}

encode <- function(M) {
        # put on hypershpere
        mins = apply(M,2,min)
        maxs = apply(M,2,max)
        ranges = maxs-mins
        Xnorm = t(apply(M,1,function(x) { 2*(x-mins)/ranges-1}))
        Xnorm
        
}
Xnorm = normalize(X)
# Save a copy of the normalized data set
X_original = Xnorm 

Run PCA

Generalized Hebbian Algorithm

Unlike classical Oja’s rule that extracts only one component at a time generalized hebbian algorithm ‘works’ on all the component simultaneously. Basically for every component k, if k != 1, the next x used in the update rule is calculated as follows: \[\mathbf{x'} = \textbf{x}-\sum_{k=1}^{j-1} {\mathbf{w_k(n)}}y_k(n)\]

The whole algorithm is therefore can be formulated as follows: \[y_j(n)=\sum_{i=1}^{m}w_{ji}(n)x_i(n)\] \[\Delta(w_{ji}(n)) = \eta (y_j(n)x_i(n)-y_j(n) \sum_{k=1}^j w_{ki}(n) y_k(n)\]

trainNetwork <- function(Xnorm,n_components,eta) {
        # initialize
        n_features = ncol(Xnorm)
        epochs = 10^n_components
        z = epochs*nrow(Xnorm)
        W = matrix(data=runif(n_features*n_components,-1,1), 
                   nrow=n_components, 
                   ncol=n_features)
        W = normalize(W)
        Y = matrix(data=0,nrow=n_obs,ncol=n_components)
        # train
        for (n in seq(1:z)) {
                ind = n%%(n_obs)+1
                for (j in seq(1:n_components)) {
                        Y[ind,j] = W[j,]%*%Xnorm[ind,]
                        W[j,] = W[j,]+eta*(Y[ind,j]*Xnorm[ind,]-Y[ind,j]*(Y[ind,1:j]%*%W[1:j,]))
                }
        }
        list(data = Y, pc = W)
}

Train Network

Now all we have to do is to train our simple network. We will use the learning rate \(\eta = \frac{1}{number of observaton}\)=0.0066667. Ideally we should have used the rule: \[learningRate = \frac{constant1}{n+constant2}\] This rule should satisfy the following: \[\lim_{n \to \infty} \eta(n) = 0\] \[\sum_{n=0}^{\infty} \eta(n) = \infty\] However, it can be really time consuming to find these constants. In this case we simply make learning rate a constant which give us rather good convergence. We will try both versions of data pre-processing. The first experiment uses just centered data and the second - normalized to (-1,1) interval.

eta = 1/n_obs
# use centered data
results = trainNetwork(Xnorm,4,eta)
# use normalized data
Xnorm2 = encode(Xnorm)
results2 = trainNetwork(Xnorm2,4,eta)
W = results$pc
W2 = results2$pc
Y = results$data
Y2 = results2$data

And we obtain the following principal components:

W
##            [,1]        [,2]       [,3]        [,4]
## [1,]  0.3868362 -0.03370618  0.8417560  0.37670935
## [2,]  0.6365878  0.71272462 -0.1696528 -0.05774605
## [3,]  0.5986174 -0.54297189 -0.1008112 -0.54043736
## [4,] -0.2621777  0.36708316  0.4533275 -0.70828539
W2
##            [,1]       [,2]        [,3]        [,4]
## [1,] -0.4349662  0.1248902 -0.61242855 -0.64859483
## [2,] -0.4298506 -0.8961971  0.09053084  0.03808553
## [3,]  0.7020452 -0.3530506  0.06045431 -0.61216302
## [4,] -0.3503979  0.2245385  0.78301676 -0.45535006

Reconstruct data

Simple as that:

\[\mathbf{\hat{X}}=\mathbf {Y W}\]

Xhat = Y%*%W
Xhat2 = Y2%*%W2

We can now evaluate the difference between the original dataset and reconstructed one. We’ll use the Euclidean norm:

error = Xnorm-Xhat
error2 = Xnorm2-Xhat2
norm(error, type="2")
## [1] 1.878206
norm(error2, type="2")
## [1] 0.1473655

It’s intersting to note that a normalized version shows better results that a centered one.

Compare to built-in PCA

The last step is to compare our results to the built-in batch PCA. Let’s calculate both versions (for both types of preprocessing):

tPCA1 = t(prcomp(Xnorm)$rotation)
tPCA2 = t(prcomp(Xnorm2)$rotation)
tPCA1
##           [,1]        [,2]        [,3]        [,4]
## PC1  0.3613866 -0.08452251  0.85667061  0.35828920
## PC2 -0.6565888 -0.73016143  0.17337266  0.07548102
## PC3  0.5820299 -0.59791083 -0.07623608 -0.54583143
## PC4  0.3154872 -0.31972310 -0.47983899  0.75365743
tPCA2
##           [,1]       [,2]        [,3]         [,4]
## PC1  0.4249421 -0.1507482  0.61626702  0.645688876
## PC2 -0.4232027 -0.9039671  0.06038308  0.009839255
## PC3  0.7135724 -0.3363160  0.06590030 -0.611034506
## PC4  0.3621300 -0.2168178 -0.78244872  0.457849207

Compare the difference between the obtainted components using Euclidean norm again.

norm(tPCA1-W,type="2")
## [1] 1.994868
norm(tPCA2-W2,type="2")
## [1] 2.001197

As we can see the results are almost identical and there is almost no difference in the results. But the first approach is much more fruitful because it required less operations and better works in online setting (we can use moving average for our stream).

Summary

For small datasets classical PCA based on Karhunen–Loève transformation is better simply because it’s faster. It takes a lot of time for GHA to converge. However, for online learning GHA is better because it allows to proccess data stream. However, one improvement can be suggested to this implementation: instead of initializing weights to random numbers we can take a k by k matrix (where k is the desired number of components), perform classical PCA on it and then run the online version that will be correcting the synaptic weights as the number of examples increase.