We will understand the methods of factor analysis using the Boston housing data set.

\[\\[.02 in]\]

Boston Housing Data

Boston housing data contains information collected by the U.S Census Service concerning housing in the area of Boston Mass. It has been used extensively throughout the literature to benchmark algorithms. The dataset has 506 cases, and 14 covariates, which are

  1. Crime - per capita crime rate by town
  2. #Large lots - proportion of residential land zoned for lots over 25,000 sq.ft.
  3. Nonretail acres - proportion of non-retail business acres per town.
  4. Charls - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
  5. Nitric oxides - nitric oxides concentration (parts per 10 million)
  6. Rooms - average number of rooms per dwelling
  7. #Old houses - proportion of owner-occupied units built prior to 1940
  8. Dist.empl.centers - weighted distances to five Boston employment centres
  9. Access to highway - index of accessibility to radial highways
  10. Tax - full-value property-tax rate per $10,000
  11. Pupil/teacher - pupil-teacher ratio by town
  12. African American - \(1000\times (B_{k} - 0.63)^2\) where \(B_k\) is the proportion of blacks by town
  13. # Lower status - % lower status of the population
  14. Value - Median value of owner-occupied homes in $1000’s

We have previously worked with Boston housing data in the context of PCA and regression.

In the following analysis I drop the variable “Charls”, as it is binary. Further, the data is centered.

library("mlbench")
data(BostonHousing)
X=as.matrix(BostonHousing[,setdiff(1:ncol(BostonHousing),4)])
X=apply(X,2,function(u){return((u-mean(u)))})
#X=apply(X,2,function(u){return((u-mean(u))/sd(u))})
colnames(X)=c("Crime","#Large lots","Nonretail acres","Nitric oxides","Rooms",
              "#Old houses","Dist.empl.centers","Access to highway","Tax","Pupil/teacher",
              "African American","# Lower status","Value")
n=nrow(X); p=ncol(X)
#head(X)

\[\\[.2 in]\]

We will apply factor analysis to this data set. Towards that, we first find the number of factors using principal factor analysis (PFA). The steps are:

  1. First find the communalities of the variables, \(i\)-th communality is \(h_i=\max_{j\neq i} |r_{i,j}|\).

\[\\[.02 in]\]

R=cor(X)
communality=apply(R,1,function(u){(max(u[u!=1]))^2})
ind_var=1-communality
table=data.frame(cbind(communality,ind_var))
colnames(table)=c("Communality","Individual variance")
rownames(table)=colnames(X)
library(knitr)
kable(table,caption = "Communality and individual variance")
Communality and individual variance
Communality Individual variance
Crime 0.3912567 0.6087433
#Large lots 0.4414383 0.5585617
Nonretail acres 0.5831635 0.4168365
Nitric oxides 0.5831635 0.4168365
Rooms 0.4835255 0.5164745
#Old houses 0.5350485 0.4649515
Dist.empl.centers 0.4414383 0.5585617
Access to highway 0.8285154 0.1714846
Tax 0.8285154 0.1714846
Pupil/teacher 0.2159844 0.7840156
African American 0.1111961 0.8888039
# Lower status 0.3645741 0.6354259
Value 0.4835255 0.5164745

The communalities \(h_i\), \(i=1, \ldots, p\), give an idea on how much the \(i\)-th variable can be explained by the common factors. If \(h_i\) is small for some \(i\) (see, for example, “African American”), then the \(i\)-th variable has high individual variance (\(\Psi_{i,i}\)). This indicates that a larger proportion of the total variability of \(X_{i}\) can not be explained by other variables.

\[\\[.01 in]\]

  1. Next, set \(\Psi_{i,i}=1-h_{i}\), and calculate the reduced correlation matrix \(R-\Psi\).

\[\\[.01 in]\]

  1. We find the spectral decomposition of \(R-\Psi\), and choose \(k=k_{0}\) such that the first \(k_{0}\) eigenvalues of \(R-\Psi\) is large, and the remaining are close to zero.
psi_hat=diag(ind_var)
reduced_corr=R-psi_hat
psi=psi_hat
SD=eigen(reduced_corr)
plot(SD$values,type="b",lwd=4,cex=.5,xlab=" ",ylab="Eigenvalues",main="Scree plot")

k_opt=3

From the scree plot it seems reasonable to choose \(k_{0}=3\).

\[\\[.01 in]\]

  1. Next, \(\Psi\) and \(\Lambda\) are estimated using maximum likelihood estimation, given \(k_{0}=3\). The steps are given below:
  1. Set an initial estimate of \(\Psi\). Here I set \(\Psi_{i,i}=(1-h_{i})s_{i,i}\).

  2. Calculate \(S_{n}^{\star}=\Psi^{-1/2}S_{n} \Psi^{-1/2}\).

  3. Get the eigen-decomposition of \(S_{n}^{\star}\). Let \(\theta_{1}, \ldots, \theta_{p}\) be the eigenvalues, and \(\boldsymbol{\gamma}_{1}, \ldots, \boldsymbol{\gamma}_{p}\) be the corresponding eigenvectors of \(S_{n}^{\star}\).

  4. Then set \(c_{j}^{\star}=\max\{\sqrt{ \theta_{1}-1}, 0\}\), and \(\boldsymbol{\lambda}_{(j)}=c_{j}^{\star} \Psi^{1/2}\boldsymbol{\gamma}_{j}\), \(j=1,\ldots,k_{0}\). Finally, \(\Lambda=\left[ \boldsymbol{\lambda}_{(1)}, \ldots, \boldsymbol{\lambda}_{(k_{0})}\right]\).

  5. Update \(\Psi_{i,i}=s_{i,i}-\boldsymbol{\lambda}_{i}^{\prime} \boldsymbol{\lambda}_{i}\), where \(\boldsymbol{\lambda}_{i}\) is the \(i\)-th row of \(\Lambda\).

  6. Calculate \(\epsilon=\max\{\|\Lambda_{\mathrm{old}}-\Lambda\|_{F}^2,\|\Psi_{\mathrm{old}}-\Psi\|_{F}^2 \}\).

  7. Repeat steps 2-5, until \(\epsilon<0.0001\).

S=cov(X) #Compute Sn
print("1. Set an initial estimate of Psi")
## [1] "1. Set an initial estimate of Psi"
psi=diag(1-communality)*S # 1. Estimate Psi=Diag((1-h_i)*si)
psi_half=diag(sqrt(diag(psi)))
psi_half_inv=diag(1/sqrt(diag(psi)))
#print("2. Calculate A")
A=(psi_half_inv%*%S%*%psi_half_inv) # 2. Calculate A
SD=eigen(A)
#print("3. Calculate Gamma")
Gamma=SD$values[1:k_opt]-1 # 3. Calculate Gamma
Gamma[which(Gamma<0)]=0
print("2. Calculate Lambda")
## [1] "2. Calculate Lambda"
Lambda=psi_half%*%SD$vectors[,1:k_opt]%*%sqrt(diag(Gamma)) # 4. Calculate Lambda
# 5. Iterate until convergence 
print("3. Update Psi")
## [1] "3. Update Psi"
print("4. Iterate steps 2-3 until the algorithm converges")
## [1] "4. Iterate steps 2-3 until the algorithm converges"
epsilon=10;
count=1
while(epsilon>0.0001)
{
  Lambda_old=Lambda
  psi_old=psi
  temp=diag(S)-apply(Lambda,1,function(u){tcrossprod(t(u))})  # 6. Update psi (considering the fact that psi_ii>0)
  idx=which(temp>0)
  if(length(idx)>0)
  {
    diag(psi)[idx]=temp[idx]
  }
  psi_half=diag(sqrt(diag(psi)))
  psi_half_inv=diag(1/sqrt(diag(psi)))
  A=(psi_half_inv%*%S%*%psi_half_inv)
  SD=eigen(A)
  Gamma=SD$values[1:k_opt]-1
  Gamma[which(Gamma<0)]=0
  Lambda=psi_half%*%SD$vectors[,1:k_opt]%*%sqrt(diag(Gamma)) # 7. Update lambda
  epsilon1=sum((Lambda-Lambda_old)^2)
  epsilon2=sum((diag(psi)-diag(psi_old))^2)
  epsilon=max(epsilon1,epsilon2) # 8. Calculate the difference in results
  #print(paste0("epsilon=",epsilon))
  count=count+1
}
print(paste0("# iterations=", count, "  epsilon=", epsilon))
## [1] "# iterations=301  epsilon=9.63773310024254e-05"

\[\\[.01 in]\]

  1. Estimated loading and individual specific variance:

The estimated factor loading matrix and individual specific variances are given below. Note that, the variables are not standardized. So, after the estimation of \(\Lambda\), the loading matrix made scale free by dividing the \(i\)-th row by \(sd(x_{i})\), and \(\Psi_{j,j}\) is normalized by \(\mathrm{var}(X_{j})\). The scaled value of \(\lambda_{i,j}\) is the Pearson’s correlation between \(i\)-th covariate and \(j\)-th factor.

s=apply(X,2,sd)
Lambda=diag(1/s)%*%Lambda #After scale adjustment
rownames(Lambda)=colnames(X)
colnames(Lambda)=c("Factor1","Factor2","Factor3")
Lambda_psi=cbind(Lambda,round((diag(psi)/s^2),digits=6))
colnames(Lambda_psi)=c("Factor1","Factor2","Factor3","Invidiviual specific var")
print(Lambda_psi)
##                      Factor1      Factor2     Factor3 Invidiviual specific var
## Crime              0.6150139 -0.038102548  0.14703509                 0.598687
## #Large lots       -0.5149653 -0.096784276  0.48664604                 0.488619
## Nonretail acres    0.8164886 -0.014042558 -0.23624258                 0.277339
## Nitric oxides      0.7994978 -0.085086226 -0.34212109                 0.236517
## Rooms             -0.4460869 -0.599039100  0.01402339                 0.441962
## #Old houses        0.6794389 -0.017111033 -0.47403002                 0.313366
## Dist.empl.centers -0.6998130  0.200792188  0.58516947                 0.127521
## Access to highway  0.8592198 -0.271692168  0.28198695                 0.108408
## Tax                0.9160949 -0.199736945  0.24623620                 0.060243
## Pupil/teacher      0.5079170  0.216860083  0.17385190                 0.664767
## African American  -0.4780229 -0.006686592 -0.07746935                 0.765448
## # Lower status     0.7268392  0.411568421 -0.12803369                 0.285924
## Value             -0.6387247 -0.699881107 -0.09262477                 0.093618

\[\\[.01 in]\]

  1. Varimax rotation:

The scaled loading matrix \(\Lambda\) is not interpretable at this moment. As the all the variables have absolute correlation at least \(0.45\) with the first factor, we can not interpret the first factor, and similarly for the other factors. Therefore, we rotate \(\Lambda\) using varimax rotation.

The optimal roration matrix obtained by varimax technique is the following:

Optimal rotation matrix
Col 1 Col 2 Col 3
0.749 0.395 -0.531
-0.379 0.914 0.146
0.543 0.092 0.835

Further, the rotated loading matrix \(\Lambda^{\star}\) is shown below. Note that, I have rearranged the rows of rotated \(\Lambda\) to facilitate comparison.

Rotated loading matrix
Factor1 Factor2 Factor3 Communality
Access to highway 0.900 0.118 -0.261 0.892
Tax 0.896 0.202 -0.310 0.940
Crime 0.555 0.222 -0.210 0.401
# Lower status 0.319 0.652 -0.433 0.714
Value -0.264 -0.901 0.160 0.906
Rooms -0.100 -0.723 0.161 0.558
Nonretail acres 0.489 0.288 -0.633 0.723
Nitric oxides 0.446 0.207 -0.723 0.763
Dist.empl.centers -0.283 -0.039 0.889 0.872
#Old houses 0.258 0.209 -0.759 0.687
#Large lots -0.085 -0.247 0.666 0.511
Pupil/teacher 0.393 0.415 -0.093 0.335
African American -0.398 -0.202 0.188 0.235

\[\\[.01 in]\]

  1. Interpretation:

Observe that “Access to highway”, “Tax” and “Crime” have high correlations with Factor 1;

“# Lower status”, “Value” and “# Rooms” have high correlations with Factor 2, and

“Nonretail acres”, “Nitric oxides”, “Distance from employer center”, “No. of old houses” have high correlations with Factor 3.

Based on these loadings, we can term Factor 1, 2 and 3 as “quality of life factor” or “convenience factor”, “economy factor”, and “society factor”, respectively.

Note that, the other two variables, “African American”, and “Puple/teacher”, do not have high correlation with any of the factors. This is because these variables have low communality, and therefore are not much explained by common factors.

\[\\[.01 in]\]

  1. Factor rotation graphically:

It is interesting to see the effect of factor rotation graphically. In the following figure the axes of rotation are shown in green, the original points are shown in blue and the rotated points are shown in red.

The plot for factors 1 and 2 are given below.

Observe that after varimax rotation, the loadings have either become large (in absolute value) or close to zero.

\[\\[.01 in]\]

  1. Graphical representation of factor loadings

The graphical representations of the factor loadings are next displayed below. From this graph one can have an idea of the group stucture among the covariates. For example, “Nonretail acres”, “Nitric oxide” and “#Old houses” seem to form a group with respect to all the three factors.

\[\\[.01 in]\]

\[\\[.01 in]\]

```{r echo=FALSE, message=FALSE, warning=FALSE} par(mar=c(4,4,4,4))

plot(VM_loading_ML[,1],VM_loading_ML[,2],cex=.2, lwd=5,xlab=“Factor 1”,ylab=“Factor 2”) lab=rownames(VM_loading_ML) for(i in 1:ncol(X)) { text(VM_loading_ML[i,1],VM_loading_ML[i,2],lab[i]) } plot(VM_loading_ML[,1],VM_loading_ML[,3],cex=.2, lwd=5,xlab=“Factor 1”,ylab=“Factor 3”) for(i in 1:ncol(X)) { text(VM_loading_ML[i,1],VM_loading_ML[i,3],lab[i]) } plot(VM_loading_ML[,2],VM_loading_ML[,3],cex=.2, lwd=5,xlab=“Factor 1”,ylab=“Factor 2”) for(i in 1:ncol(X)) { text(VM_loading_ML[i,2],VM_loading_ML[i,3],lab[i]) }


The association of different variables can also be viewed in a 3-factor co-ordinate system.



$$\\[.01 in]$$

10. *How sensitive is this result on the method of estimation?*

It is also interesting to comprare the results obtained by ML estimation and that by PFA. Will the factors change when PFA is applied instead of ML?

Below we estimate the loadings using PFA and compare the varimax rorated factors with those obtained by ML estimation.

Steps of PFA are given below:

(a) Set an initial estimate of $\Psi$. Here I set $\Psi_{i,i}=(1-h_{i})$.

(b) Calculate the reduced correlation matrix $R-\Psi$.

(c) Calculate $\Lambda$ from the eigen decomposition of $R-\Psi$.

(d) Update $\Psi=I-\mathrm{Diag}(\Lambda\Lambda^{\prime})$.

(e) Calculate $\epsilon=\max\{\|\Lambda_{\mathrm{old}}-\Lambda\|_{F}^2,\|\Psi_{\mathrm{old}}-\Psi\|_{F}^2 \}$.

(f) Repeat steps 2-5, until $\epsilon<0.0001$.

\`\`\`{r echo=FALSE, message=FALSE, warning=FALSE} R=cor(X) print("1. Compute communality") communality=apply(R,1,function(u){(max(u[u!=1]))\^2}) \# 1. Compute communality print("2. Set psi_i=1-communality_i") psi_hat=diag(1-communality) \# 2. Initial estimate of psi reduced_corr=R-psi_hat psi=psi_hat SD=eigen(reduced_corr) \# We choose k=3 print("3. Estimate k") k_opt=3 \# 3 .estimate k print("4. Compute Lambda") Lambda=SD$vectors[,1:k_opt]%*%sqrt(diag(SD$values[1:k_opt])) \# 4. compute lambda \# Iterate until convergence print("5. Update Psi") print("6. Repeat steps 4-5 until the algorithm converges") epsilon=10; count=1 while(epsilon\>0.0001) { Lambda_old=Lambda psi_old=psi temp=1-apply(Lambda,1,function(u){tcrossprod(t(u))}) idx=which(temp\>0) if(length(idx)\>0) { diag(psi)[idx]=temp[idx] } reduced_corr=R-psi SD=eigen(reduced_corr) Lambda=SD$vectors[,1:k_opt]%*%sqrt(diag(SD$values[1:k_opt])) epsilon1=sum((Lambda-Lambda_old)\^2) epsilon2=sum((diag(psi)-diag(psi_old))\^2) epsilon=max(epsilon1,epsilon2) \# print(paste0("epsilon=",epsilon)) count=count+1 } print(paste0("\# iterations=", count, " epsilon=", epsilon))

We next apply varimax rotation on the estimated factor loading matrix.

Rotated loading matrix
Factor 1 Factor 2 Factor 3 Communality
Access to highway 0.900 0.118 -0.261 0.892
Tax 0.896 0.202 -0.310 0.940
Crime 0.555 0.222 -0.210 0.401
# Lower status 0.319 0.652 -0.433 0.714
Value -0.264 -0.901 0.160 0.906
Rooms -0.100 -0.723 0.161 0.558
Nonretail acres 0.489 0.288 -0.633 0.723
Nitric oxides 0.446 0.207 -0.723 0.763
Dist.empl.centers -0.283 -0.039 0.889 0.872
#Old houses 0.258 0.209 -0.759 0.687
#Large lots -0.085 -0.247 0.666 0.511
Pupil/teacher 0.393 0.415 -0.093 0.335
African American -0.398 -0.202 0.188 0.235

\[\\[.01 in]\]

  1. Similarity and differences:

Observe that, the correlations of

“Crime”, “Tax” and “Access to highway” are high with Factor 3; that of

“Nonretail acres”, “Nitric oxides”, “Distance from employer center”, “No. of old houses” are high correlation with Factor 1 here.

Therefore, Factors 1, 2 and 3 are the “society”, “economy”, and “convenience” factors here. Further, the loadings are also different. However, the groups formed by the variables remain the same in both the cases. Thus, the latent factor identities remain same across different methods of estimation.

\[\\[.2 in]\]