We will understand the methods of factor analysis using the Boston housing data set.
\[\\[.02 in]\]
Boston housing data contains information collected by the U.S Census Service concerning housing in the area of Boston Mass. It has been used extensively throughout the literature to benchmark algorithms. The dataset has 506 cases, and 14 covariates, which are
We have previously worked with Boston housing data in the context of PCA and regression.
In the following analysis I drop the variable “Charls”, as it is binary. Further, the data is centered.
library("mlbench")
data(BostonHousing)
X=as.matrix(BostonHousing[,setdiff(1:ncol(BostonHousing),4)])
X=apply(X,2,function(u){return((u-mean(u)))})
#X=apply(X,2,function(u){return((u-mean(u))/sd(u))})
colnames(X)=c("Crime","#Large lots","Nonretail acres","Nitric oxides","Rooms",
"#Old houses","Dist.empl.centers","Access to highway","Tax","Pupil/teacher",
"African American","# Lower status","Value")
n=nrow(X); p=ncol(X)
#head(X)
\[\\[.2 in]\]
We will apply factor analysis to this data set. Towards that, we first find the number of factors using principal factor analysis (PFA). The steps are:
\[\\[.02 in]\]
R=cor(X)
communality=apply(R,1,function(u){(max(u[u!=1]))^2})
ind_var=1-communality
table=data.frame(cbind(communality,ind_var))
colnames(table)=c("Communality","Individual variance")
rownames(table)=colnames(X)
library(knitr)
kable(table,caption = "Communality and individual variance")
| Communality | Individual variance | |
|---|---|---|
| Crime | 0.3912567 | 0.6087433 |
| #Large lots | 0.4414383 | 0.5585617 |
| Nonretail acres | 0.5831635 | 0.4168365 |
| Nitric oxides | 0.5831635 | 0.4168365 |
| Rooms | 0.4835255 | 0.5164745 |
| #Old houses | 0.5350485 | 0.4649515 |
| Dist.empl.centers | 0.4414383 | 0.5585617 |
| Access to highway | 0.8285154 | 0.1714846 |
| Tax | 0.8285154 | 0.1714846 |
| Pupil/teacher | 0.2159844 | 0.7840156 |
| African American | 0.1111961 | 0.8888039 |
| # Lower status | 0.3645741 | 0.6354259 |
| Value | 0.4835255 | 0.5164745 |
The communalities \(h_i\), \(i=1, \ldots, p\), give an idea on how much the \(i\)-th variable can be explained by the common factors. If \(h_i\) is small for some \(i\) (see, for example, “African American”), then the \(i\)-th variable has high individual variance (\(\Psi_{i,i}\)). This indicates that a larger proportion of the total variability of \(X_{i}\) can not be explained by other variables.
\[\\[.01 in]\]
\[\\[.01 in]\]
psi_hat=diag(ind_var)
reduced_corr=R-psi_hat
psi=psi_hat
SD=eigen(reduced_corr)
plot(SD$values,type="b",lwd=4,cex=.5,xlab=" ",ylab="Eigenvalues",main="Scree plot")
k_opt=3
From the scree plot it seems reasonable to choose \(k_{0}=3\).
\[\\[.01 in]\]
Set an initial estimate of \(\Psi\). Here I set \(\Psi_{i,i}=(1-h_{i})s_{i,i}\).
Calculate \(S_{n}^{\star}=\Psi^{-1/2}S_{n} \Psi^{-1/2}\).
Get the eigen-decomposition of \(S_{n}^{\star}\). Let \(\theta_{1}, \ldots, \theta_{p}\) be the eigenvalues, and \(\boldsymbol{\gamma}_{1}, \ldots, \boldsymbol{\gamma}_{p}\) be the corresponding eigenvectors of \(S_{n}^{\star}\).
Then set \(c_{j}^{\star}=\max\{\sqrt{ \theta_{1}-1}, 0\}\), and \(\boldsymbol{\lambda}_{(j)}=c_{j}^{\star} \Psi^{1/2}\boldsymbol{\gamma}_{j}\), \(j=1,\ldots,k_{0}\). Finally, \(\Lambda=\left[ \boldsymbol{\lambda}_{(1)}, \ldots, \boldsymbol{\lambda}_{(k_{0})}\right]\).
Update \(\Psi_{i,i}=s_{i,i}-\boldsymbol{\lambda}_{i}^{\prime} \boldsymbol{\lambda}_{i}\), where \(\boldsymbol{\lambda}_{i}\) is the \(i\)-th row of \(\Lambda\).
Calculate \(\epsilon=\max\{\|\Lambda_{\mathrm{old}}-\Lambda\|_{F}^2,\|\Psi_{\mathrm{old}}-\Psi\|_{F}^2 \}\).
Repeat steps 2-5, until \(\epsilon<0.0001\).
S=cov(X) #Compute Sn
print("1. Set an initial estimate of Psi")
## [1] "1. Set an initial estimate of Psi"
psi=diag(1-communality)*S # 1. Estimate Psi=Diag((1-h_i)*si)
psi_half=diag(sqrt(diag(psi)))
psi_half_inv=diag(1/sqrt(diag(psi)))
#print("2. Calculate A")
A=(psi_half_inv%*%S%*%psi_half_inv) # 2. Calculate A
SD=eigen(A)
#print("3. Calculate Gamma")
Gamma=SD$values[1:k_opt]-1 # 3. Calculate Gamma
Gamma[which(Gamma<0)]=0
print("2. Calculate Lambda")
## [1] "2. Calculate Lambda"
Lambda=psi_half%*%SD$vectors[,1:k_opt]%*%sqrt(diag(Gamma)) # 4. Calculate Lambda
# 5. Iterate until convergence
print("3. Update Psi")
## [1] "3. Update Psi"
print("4. Iterate steps 2-3 until the algorithm converges")
## [1] "4. Iterate steps 2-3 until the algorithm converges"
epsilon=10;
count=1
while(epsilon>0.0001)
{
Lambda_old=Lambda
psi_old=psi
temp=diag(S)-apply(Lambda,1,function(u){tcrossprod(t(u))}) # 6. Update psi (considering the fact that psi_ii>0)
idx=which(temp>0)
if(length(idx)>0)
{
diag(psi)[idx]=temp[idx]
}
psi_half=diag(sqrt(diag(psi)))
psi_half_inv=diag(1/sqrt(diag(psi)))
A=(psi_half_inv%*%S%*%psi_half_inv)
SD=eigen(A)
Gamma=SD$values[1:k_opt]-1
Gamma[which(Gamma<0)]=0
Lambda=psi_half%*%SD$vectors[,1:k_opt]%*%sqrt(diag(Gamma)) # 7. Update lambda
epsilon1=sum((Lambda-Lambda_old)^2)
epsilon2=sum((diag(psi)-diag(psi_old))^2)
epsilon=max(epsilon1,epsilon2) # 8. Calculate the difference in results
#print(paste0("epsilon=",epsilon))
count=count+1
}
print(paste0("# iterations=", count, " epsilon=", epsilon))
## [1] "# iterations=301 epsilon=9.63773310024254e-05"
\[\\[.01 in]\]
The estimated factor loading matrix and individual specific variances are given below. Note that, the variables are not standardized. So, after the estimation of \(\Lambda\), the loading matrix made scale free by dividing the \(i\)-th row by \(sd(x_{i})\), and \(\Psi_{j,j}\) is normalized by \(\mathrm{var}(X_{j})\). The scaled value of \(\lambda_{i,j}\) is the Pearson’s correlation between \(i\)-th covariate and \(j\)-th factor.
s=apply(X,2,sd)
Lambda=diag(1/s)%*%Lambda #After scale adjustment
rownames(Lambda)=colnames(X)
colnames(Lambda)=c("Factor1","Factor2","Factor3")
Lambda_psi=cbind(Lambda,round((diag(psi)/s^2),digits=6))
colnames(Lambda_psi)=c("Factor1","Factor2","Factor3","Invidiviual specific var")
print(Lambda_psi)
## Factor1 Factor2 Factor3 Invidiviual specific var
## Crime 0.6150139 -0.038102548 0.14703509 0.598687
## #Large lots -0.5149653 -0.096784276 0.48664604 0.488619
## Nonretail acres 0.8164886 -0.014042558 -0.23624258 0.277339
## Nitric oxides 0.7994978 -0.085086226 -0.34212109 0.236517
## Rooms -0.4460869 -0.599039100 0.01402339 0.441962
## #Old houses 0.6794389 -0.017111033 -0.47403002 0.313366
## Dist.empl.centers -0.6998130 0.200792188 0.58516947 0.127521
## Access to highway 0.8592198 -0.271692168 0.28198695 0.108408
## Tax 0.9160949 -0.199736945 0.24623620 0.060243
## Pupil/teacher 0.5079170 0.216860083 0.17385190 0.664767
## African American -0.4780229 -0.006686592 -0.07746935 0.765448
## # Lower status 0.7268392 0.411568421 -0.12803369 0.285924
## Value -0.6387247 -0.699881107 -0.09262477 0.093618
\[\\[.01 in]\]
The scaled loading matrix \(\Lambda\) is not interpretable at this moment. As the all the variables have absolute correlation at least \(0.45\) with the first factor, we can not interpret the first factor, and similarly for the other factors. Therefore, we rotate \(\Lambda\) using varimax rotation.
The optimal roration matrix obtained by varimax technique is the following:
| Col 1 | Col 2 | Col 3 |
|---|---|---|
| 0.749 | 0.395 | -0.531 |
| -0.379 | 0.914 | 0.146 |
| 0.543 | 0.092 | 0.835 |
Further, the rotated loading matrix \(\Lambda^{\star}\) is shown below. Note that, I have rearranged the rows of rotated \(\Lambda\) to facilitate comparison.
| Factor1 | Factor2 | Factor3 | Communality | |
|---|---|---|---|---|
| Access to highway | 0.900 | 0.118 | -0.261 | 0.892 |
| Tax | 0.896 | 0.202 | -0.310 | 0.940 |
| Crime | 0.555 | 0.222 | -0.210 | 0.401 |
| # Lower status | 0.319 | 0.652 | -0.433 | 0.714 |
| Value | -0.264 | -0.901 | 0.160 | 0.906 |
| Rooms | -0.100 | -0.723 | 0.161 | 0.558 |
| Nonretail acres | 0.489 | 0.288 | -0.633 | 0.723 |
| Nitric oxides | 0.446 | 0.207 | -0.723 | 0.763 |
| Dist.empl.centers | -0.283 | -0.039 | 0.889 | 0.872 |
| #Old houses | 0.258 | 0.209 | -0.759 | 0.687 |
| #Large lots | -0.085 | -0.247 | 0.666 | 0.511 |
| Pupil/teacher | 0.393 | 0.415 | -0.093 | 0.335 |
| African American | -0.398 | -0.202 | 0.188 | 0.235 |
\[\\[.01 in]\]
Observe that “Access to highway”, “Tax” and “Crime” have high correlations with Factor 1;
“# Lower status”, “Value” and “# Rooms” have high correlations with Factor 2, and
“Nonretail acres”, “Nitric oxides”, “Distance from employer center”, “No. of old houses” have high correlations with Factor 3.
Based on these loadings, we can term Factor 1, 2 and 3 as “quality of life factor” or “convenience factor”, “economy factor”, and “society factor”, respectively.
Note that, the other two variables, “African American”, and “Puple/teacher”, do not have high correlation with any of the factors. This is because these variables have low communality, and therefore are not much explained by common factors.
\[\\[.01 in]\]
It is interesting to see the effect of factor rotation graphically. In the following figure the axes of rotation are shown in green, the original points are shown in blue and the rotated points are shown in red.
The plot for factors 1 and 2 are given below.
Observe that after varimax rotation, the loadings have either become large (in absolute value) or close to zero.
\[\\[.01 in]\]
The graphical representations of the factor loadings are next displayed below. From this graph one can have an idea of the group stucture among the covariates. For example, “Nonretail acres”, “Nitric oxide” and “#Old houses” seem to form a group with respect to all the three factors.
\[\\[.01 in]\]
\[\\[.01 in]\]
```{r echo=FALSE, message=FALSE, warning=FALSE} par(mar=c(4,4,4,4))
plot(VM_loading_ML[,1],VM_loading_ML[,2],cex=.2, lwd=5,xlab=“Factor 1”,ylab=“Factor 2”) lab=rownames(VM_loading_ML) for(i in 1:ncol(X)) { text(VM_loading_ML[i,1],VM_loading_ML[i,2],lab[i]) } plot(VM_loading_ML[,1],VM_loading_ML[,3],cex=.2, lwd=5,xlab=“Factor 1”,ylab=“Factor 3”) for(i in 1:ncol(X)) { text(VM_loading_ML[i,1],VM_loading_ML[i,3],lab[i]) } plot(VM_loading_ML[,2],VM_loading_ML[,3],cex=.2, lwd=5,xlab=“Factor 1”,ylab=“Factor 2”) for(i in 1:ncol(X)) { text(VM_loading_ML[i,2],VM_loading_ML[i,3],lab[i]) }
The association of different variables can also be viewed in a 3-factor co-ordinate system.
$$\\[.01 in]$$
10. *How sensitive is this result on the method of estimation?*
It is also interesting to comprare the results obtained by ML estimation and that by PFA. Will the factors change when PFA is applied instead of ML?
Below we estimate the loadings using PFA and compare the varimax rorated factors with those obtained by ML estimation.
Steps of PFA are given below:
(a) Set an initial estimate of $\Psi$. Here I set $\Psi_{i,i}=(1-h_{i})$.
(b) Calculate the reduced correlation matrix $R-\Psi$.
(c) Calculate $\Lambda$ from the eigen decomposition of $R-\Psi$.
(d) Update $\Psi=I-\mathrm{Diag}(\Lambda\Lambda^{\prime})$.
(e) Calculate $\epsilon=\max\{\|\Lambda_{\mathrm{old}}-\Lambda\|_{F}^2,\|\Psi_{\mathrm{old}}-\Psi\|_{F}^2 \}$.
(f) Repeat steps 2-5, until $\epsilon<0.0001$.
\`\`\`{r echo=FALSE, message=FALSE, warning=FALSE} R=cor(X) print("1. Compute communality") communality=apply(R,1,function(u){(max(u[u!=1]))\^2}) \# 1. Compute communality print("2. Set psi_i=1-communality_i") psi_hat=diag(1-communality) \# 2. Initial estimate of psi reduced_corr=R-psi_hat psi=psi_hat SD=eigen(reduced_corr) \# We choose k=3 print("3. Estimate k") k_opt=3 \# 3 .estimate k print("4. Compute Lambda") Lambda=SD$vectors[,1:k_opt]%*%sqrt(diag(SD$values[1:k_opt])) \# 4. compute lambda \# Iterate until convergence print("5. Update Psi") print("6. Repeat steps 4-5 until the algorithm converges") epsilon=10; count=1 while(epsilon\>0.0001) { Lambda_old=Lambda psi_old=psi temp=1-apply(Lambda,1,function(u){tcrossprod(t(u))}) idx=which(temp\>0) if(length(idx)\>0) { diag(psi)[idx]=temp[idx] } reduced_corr=R-psi SD=eigen(reduced_corr) Lambda=SD$vectors[,1:k_opt]%*%sqrt(diag(SD$values[1:k_opt])) epsilon1=sum((Lambda-Lambda_old)\^2) epsilon2=sum((diag(psi)-diag(psi_old))\^2) epsilon=max(epsilon1,epsilon2) \# print(paste0("epsilon=",epsilon)) count=count+1 } print(paste0("\# iterations=", count, " epsilon=", epsilon))
We next apply varimax rotation on the estimated factor loading matrix.
| Factor 1 | Factor 2 | Factor 3 | Communality | |
|---|---|---|---|---|
| Access to highway | 0.900 | 0.118 | -0.261 | 0.892 |
| Tax | 0.896 | 0.202 | -0.310 | 0.940 |
| Crime | 0.555 | 0.222 | -0.210 | 0.401 |
| # Lower status | 0.319 | 0.652 | -0.433 | 0.714 |
| Value | -0.264 | -0.901 | 0.160 | 0.906 |
| Rooms | -0.100 | -0.723 | 0.161 | 0.558 |
| Nonretail acres | 0.489 | 0.288 | -0.633 | 0.723 |
| Nitric oxides | 0.446 | 0.207 | -0.723 | 0.763 |
| Dist.empl.centers | -0.283 | -0.039 | 0.889 | 0.872 |
| #Old houses | 0.258 | 0.209 | -0.759 | 0.687 |
| #Large lots | -0.085 | -0.247 | 0.666 | 0.511 |
| Pupil/teacher | 0.393 | 0.415 | -0.093 | 0.335 |
| African American | -0.398 | -0.202 | 0.188 | 0.235 |
\[\\[.01 in]\]
Observe that, the correlations of
“Crime”, “Tax” and “Access to highway” are high with Factor 3; that of
“Nonretail acres”, “Nitric oxides”, “Distance from employer center”, “No. of old houses” are high correlation with Factor 1 here.
Therefore, Factors 1, 2 and 3 are the “society”, “economy”, and “convenience” factors here. Further, the loadings are also different. However, the groups formed by the variables remain the same in both the cases. Thus, the latent factor identities remain same across different methods of estimation.
\[\\[.2 in]\]