This work implements Hebbian-based PCA implemented as a simple feedforward one-layer neural network. It has m inputs that corresponds to the number of features of our original data. There are l outputs (that are linear neurons) that correspond to the number of components we want to extract.
I took iris dataset not because it’s small but due to limitations of my personal computer. Since we want to get back all the components, it will require 10^(number of features) epochs to converge. Iris has only 4 dimensions and it consumes significant amount of CPU time to converge (10^4 epochs).
The preprocessing step includes centering the data. It’s also reasonable to encode the data on the hypershere. However, the algorithm converges both ways to the values very close to the pricipal component obtained using classical PCA.
normalize <- function(M) {
#center data
means = apply(M,2,mean)
Xnorm = t(apply(M,1,function(x) {x-means}))
Xnorm
}
encode <- function(M) {
# put on hypershpere
mins = apply(M,2,min)
maxs = apply(M,2,max)
ranges = maxs-mins
Xnorm = t(apply(M,1,function(x) { 2*(x-mins)/ranges-1}))
Xnorm
}
Xnorm = normalize(X)
# Save a copy of the normalized data set
X_original = Xnorm
Unlike classical Oja’s rule that extracts only one component at a time generalized hebbian algorithm ‘works’ on all the component simultaneously. Basically for every component k, if k != 1, the next x used in the update rule is calculated as follows: \[\mathbf{x'} = \textbf{x}-\sum_{k=1}^{j-1} {\mathbf{w_k(n)}}y_k(n)\]
The whole algorithm is therefore can be formulated as follows: \[y_j(n)=\sum_{i=1}^{m}w_{ji}(n)x_i(n)\] \[\Delta(w_{ji}(n)) = \eta (y_j(n)x_i(n)-y_j(n) \sum_{k=1}^j w_{ki}(n) y_k(n)\]
trainNetwork <- function(Xnorm,n_components,eta) {
# initialize
n_features = ncol(Xnorm)
epochs = 10^n_components
z = epochs*nrow(Xnorm)
W = matrix(data=runif(n_features*n_components,-1,1),
nrow=n_components,
ncol=n_features)
W = normalize(W)
Y = matrix(data=0,nrow=n_obs,ncol=n_components)
# train
for (n in seq(1:z)) {
ind = n%%(n_obs)+1
for (j in seq(1:n_components)) {
Y[ind,j] = W[j,]%*%Xnorm[ind,]
W[j,] = W[j,]+eta*(Y[ind,j]*Xnorm[ind,]-Y[ind,j]*(Y[ind,1:j]%*%W[1:j,]))
}
}
list(data = Y, pc = W)
}
Now all we have to do is to train our simple network. We will use the learning rate \(\eta = \frac{1}{number of observaton}\)=0.0066667. Ideally we should have used the rule: \[learningRate = \frac{constant1}{n+constant2}\] This rule should satisfy the following: \[\lim_{n \to \infty} \eta(n) = 0\] \[\sum_{n=0}^{\infty} \eta(n) = \infty\] However, it can be really time consuming to find these constants. In this case we simply make learning rate a constant which give us rather good convergence. We will try both versions of data pre-processing. The first experiment uses just centered data and the second - normalized to (-1,1) interval.
eta = 1/n_obs
# use centered data
results = trainNetwork(Xnorm,4,eta)
# use normalized data
Xnorm2 = encode(Xnorm)
results2 = trainNetwork(Xnorm2,4,eta)
W = results$pc
W2 = results2$pc
Y = results$data
Y2 = results2$data
And we obtain the following principal components:
W
## [,1] [,2] [,3] [,4]
## [1,] 0.3868362 -0.03370618 0.8417560 0.37670935
## [2,] 0.6365878 0.71272462 -0.1696528 -0.05774605
## [3,] 0.5986174 -0.54297189 -0.1008112 -0.54043736
## [4,] -0.2621777 0.36708316 0.4533275 -0.70828539
W2
## [,1] [,2] [,3] [,4]
## [1,] -0.4349662 0.1248902 -0.61242855 -0.64859483
## [2,] -0.4298506 -0.8961971 0.09053084 0.03808553
## [3,] 0.7020452 -0.3530506 0.06045431 -0.61216302
## [4,] -0.3503979 0.2245385 0.78301676 -0.45535006
Simple as that:
\[\mathbf{\hat{X}}=\mathbf {Y W}\]
Xhat = Y%*%W
Xhat2 = Y2%*%W2
We can now evaluate the difference between the original dataset and reconstructed one. We’ll use the Euclidean norm:
error = Xnorm-Xhat
error2 = Xnorm2-Xhat2
norm(error, type="2")
## [1] 1.878206
norm(error2, type="2")
## [1] 0.1473655
It’s intersting to note that a normalized version shows better results that a centered one.
The last step is to compare our results to the built-in batch PCA. Let’s calculate both versions (for both types of preprocessing):
tPCA1 = t(prcomp(Xnorm)$rotation)
tPCA2 = t(prcomp(Xnorm2)$rotation)
tPCA1
## [,1] [,2] [,3] [,4]
## PC1 0.3613866 -0.08452251 0.85667061 0.35828920
## PC2 -0.6565888 -0.73016143 0.17337266 0.07548102
## PC3 0.5820299 -0.59791083 -0.07623608 -0.54583143
## PC4 0.3154872 -0.31972310 -0.47983899 0.75365743
tPCA2
## [,1] [,2] [,3] [,4]
## PC1 0.4249421 -0.1507482 0.61626702 0.645688876
## PC2 -0.4232027 -0.9039671 0.06038308 0.009839255
## PC3 0.7135724 -0.3363160 0.06590030 -0.611034506
## PC4 0.3621300 -0.2168178 -0.78244872 0.457849207
Compare the difference between the obtainted components using Euclidean norm again.
norm(tPCA1-W,type="2")
## [1] 1.994868
norm(tPCA2-W2,type="2")
## [1] 2.001197
As we can see the results are almost identical and there is almost no difference in the results. But the first approach is much more fruitful because it required less operations and better works in online setting (we can use moving average for our stream).
For small datasets classical PCA based on Karhunen–Loève transformation is better simply because it’s faster. It takes a lot of time for GHA to converge. However, for online learning GHA is better because it allows to proccess data stream. However, one improvement can be suggested to this implementation: instead of initializing weights to random numbers we can take a k by k matrix (where k is the desired number of components), perform classical PCA on it and then run the online version that will be correcting the synaptic weights as the number of examples increase.