We analyze the data set (you can download the data from the “Data Sets” folder) from 1904 Spearman’s paper that may be regarded as the starting point of factor analysis. A bit of background first. Spearman was working with scores obtained in examinations, and noticed certain systematic effects in the matrix of correlations between scores. One of his examples, obtained from 33 “high-class preparatory school” children, was as follows:
# data covariance matrix
setwd("/Users/kevinpiger/Desktop")
sub <- read.table("spearman.txt")
sub
## Classics French English Mathematics Discrimination Music
## Classics 1.00 0.83 0.78 0.70 0.66 0.63
## French 0.83 1.00 0.67 0.67 0.65 0.57
## English 0.78 0.67 1.00 0.64 0.54 0.51
## Mathematics 0.70 0.67 0.64 1.00 0.45 0.51
## Discrimination 0.66 0.65 0.54 0.45 1.00 0.40
## Music 0.63 0.58 0.51 0.51 0.40 1.00
The principal factor analysis method dates from the days of limited computational power, and is not intrinsically scale invariant. The preferred method is to utilize a Maximum-likelihood method (the default method in R). Let’s now start with fitting the one factor model:
spearman.mle<-factanal(covmat=as.matrix(sub),factors=1,n.obs = 33) # one factor, need total # of observation
spearman.mle
##
## Call:
## factanal(factors = 1, covmat = as.matrix(sub), n.obs = 33)
##
## Uniquenesses:
## Classics French English Mathematics Discrimination
## 0.088 0.240 0.349 0.447 0.525
## Music
## 0.570
##
## Loadings:
## Factor1
## Classics 0.955
## French 0.872
## English 0.807
## Mathematics 0.744
## Discrimination 0.689
## Music 0.655
##
## Factor1
## SS loadings 3.781
## Proportion Var 0.630
##
## Test of the hypothesis that 1 factor is sufficient.
## The chi square statistic is 2.54 on 9 degrees of freedom.
## The p-value is 0.98
Note: If your input data is the correlation matrix, the total # of observations needs to be specified in order to perform the above test.
1-spearman.mle$uniq # model 可以解釋的部分, uniq是海神叉的部分
## Classics French English Mathematics Discrimination
## 0.9124518 0.7603502 0.6513125 0.5531795 0.4746379
## Music
## 0.4295541
Notice that 63% of the total variation is explained by this single factor model (which is not bad at all). In particular, Classics and French are best explained (approximately 91% and 76% of the variation in the scores of these two subjects is explained by the model respectively), while Discrimination and Music are the least explained (47% and 43% might be a bit lower, but acceptable).
Let us consider the two factor model and see what happens:
spearman.mle2<-factanal(covmat=as.matrix(sub),factors=2,n.obs=33)
spearman.mle2
##
## Call:
## factanal(factors = 2, covmat = as.matrix(sub), n.obs = 33)
##
## Uniquenesses:
## Classics French English Mathematics Discrimination
## 0.094 0.239 0.343 0.426 0.005
## Music
## 0.558
##
## Loadings:
## Factor1 Factor2
## Classics 0.870 0.387
## French 0.768 0.414
## English 0.752 0.302
## Mathematics 0.727 0.215
## Discrimination 0.342 0.937
## Music 0.636 0.195
##
## Factor1 Factor2
## SS loadings 2.961 1.375
## Proportion Var 0.493 0.229
## Cumulative Var 0.493 0.723
##
## Test of the hypothesis that 2 factors are sufficient.
## The chi square statistic is 1.1 on 4 degrees of freedom.
## The p-value is 0.894
1-spearman.mle2$uniq
## Classics French English Mathematics Discrimination
## 0.9062458 0.7609031 0.6571494 0.5742923 0.9950000
## Music
## 0.4421487
As can be seen, the explained proportion of the total variation is increased to 72%. This two factor model significantly increases the explained % of variation for the subject “Discrimination” (up to 99.5%), but for the subject “Music”, 44% is still not particularly satisfactory. However, there appears to be overlapping on the loadings of the obtained two factors — even when the cutoff point 0.4 is used. This makes the resulting two factors difficult to explain.
We plot the 2-D factor loadings of each variable:
subnames <- names(sub)
l<-loadings(spearman.mle2)
plot(l[,1],l[,2],type="n",xlab="Factor 1",ylab="Factor 2",xlim=c(0,1),ylim=c(0,1))
text(l[,1],l[,2],subnames)
abline(v=0.4,lty=2);abline(h=0.4,lty=2)
Q: How do you interpret the result? Remark: The default for the mle method is a “varimax” rotation, which is orthogonal.
Another possibility in R is “promax” rotation, which is oblique.
We next consider the two factor model with a non-orthogonal rotation “promax”:
spearman.mle3<-factanal(covmat=as.matrix(sub),factors=2,rotation="promax",n.obs=33)
spearman.mle3
##
## Call:
## factanal(factors = 2, covmat = as.matrix(sub), n.obs = 33, rotation = "promax")
##
## Uniquenesses:
## Classics French English Mathematics Discrimination
## 0.094 0.239 0.343 0.426 0.005
## Music
## 0.558
##
## Loadings:
## Factor1 Factor2
## Classics 0.915
## French 0.784 0.136
## English 0.803
## Mathematics 0.801
## Discrimination 0.939
## Music 0.699
##
## Factor1 Factor2
## SS loadings 3.235 0.913
## Proportion Var 0.539 0.152
## Cumulative Var 0.539 0.691
##
## Factor Correlations:
## Factor1 Factor2
## Factor1 1.0 0.6
## Factor2 0.6 1.0
##
## Test of the hypothesis that 2 factors are sufficient.
## The chi square statistic is 1.1 on 4 degrees of freedom.
## The p-value is 0.894
Now about 69% of the total variation can be explained – a bit smaller than before. Let’s compute the communalities:
1-spearman.mle3$uniq
## Classics French English Mathematics Discrimination
## 0.9062458 0.7609031 0.6571494 0.5742923 0.9950000
## Music
## 0.4421487
As can be seen, thought we lose the property of orthogonality, the interpretation of this two factor model has become more interesting. Again, by using a common cutoff point 0.4, all the variables contribute to Factor 1, except for “Discrimination”. This can be checked by making the following plot:
l<-loadings(spearman.mle3)
plot(l[,1],l[,2],type="n",xlab="Factor 1",ylab="Factor 2")
text(l[,1],l[,2],subnames)
abline(v=0.4,lty=2);abline(h=0.4,lty=2)
Taking (0.4, 0.4) as the center, the variables were divided into two groups – one in the lower right and the other in the upper left. This gives a clear and perfect interpretation of the two factors, i.e. no overlapping of variables with loadings higher than 0.4. You can further check the fitting of the 3 factor model (where 3 is the maximum number of factors for fitting the model).
Remark: The degrees of freedom parameter is given by the expression
df = m(m+1)/2-[mk+m-k(k-1)/2], where m is the number of variables and k the number of factors.
In case the input x is a data matrix, you can request factor scores as well. For example, let us consider the “City Crime Data” and perform a 2D principal FA with the “Bartlett” score:
# anothor package sychou
setwd("/Users/kevinpiger/Desktop")
x<-read.table("citycrime.txt",h=T)
library(robCompositions)
## Loading required package: robustbase
## Loading required package: ggplot2
## Loading required package: data.table
## Loading required package: e1071
## Loading required package: pls
##
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
##
## loadings
## sROC 0.1-2 loaded
##
## Attaching package: 'robCompositions'
## The following object is masked from 'package:robustbase':
##
## alcohol
factor <- pfa(x,factors=2,scores="Bartlett")
biplot(factor)