Factor Analysis

Introduction

There are some areas in which we cannot measure directly the concepts of primary interest such as intelligence or social class which are called “latent variables”. In such cases, we collect information that can be measured and observed, and assume that these variables might be the best indicator of the primary interest. These variables are called “manifest variables”.

For example, people may respond similarly to questions about income, education, and occupation, which are all associated with the latent variable socioeconomic status.

Factor analysis most generally is used to help uncover the relationships between the latent variables and the manifest variables.

Objective:

Perform an exploratory factor analysis on the Workforce data set to investigate the relationship between manifest variables and latent variables(factors).

Data:

Work Force Data Set:

This data set showing percentages of labor force in nine different types of industry for 30 European countries. The data is grouped by the political entities to which the countries belong, e.g. EU, EFTA, and also by the year in which the data is taken. Variables: AGR - agriculture, MIN - mining, MAN - manufacturing, PS - power, CON - construction, SER - services, SPS - social services, TC - transport.

Method:

Explore the Dataset and find Correlation Matrix.
Choose the number of factors which will be used to describe manifest variables.
Estimation methods:
1. Principal components method
2. Principal factor model
3. Maximum likelihood estimation
Interpret the findings.

DATA ANALYSIS

library(corrplot)

## corrplot 0.90 loaded

library(psych, warn.conflicts = FALSE, quietly=TRUE)
library(nFactors)

## Warning: package 'nFactors' was built under R version 4.0.5

## Loading required package: lattice

## 
## Attaching package: 'nFactors'

## The following object is masked from 'package:lattice':
## 
##     parallel

df <- read.table("workforce.txt", header = T)

Exploring Data set

head(df)

##   Country Group  AGR MIN  MAN  PS CON  SER  FIN  SPS  TC YEAR
## 1 Belgium    EU  2.6 0.2 20.8 0.8 6.3 16.9  8.7 36.9 6.8 1990
## 2 Denmark    EU  5.6 0.1 20.4 0.7 6.4 14.5  9.1 36.3 7.0 1991
## 3  France    EU  5.1 0.3 20.2 0.9 7.1 16.7 10.2 33.1 6.4 1992
## 4 Germany    EU  3.2 0.7 24.8 1.0 9.4 17.2  9.6 28.4 5.6 1995
## 5  Greece    EU 22.2 0.5 19.2 1.0 6.8 18.2  5.3 19.8 6.9 1991
## 6 Ireland    EU 13.8 0.6 19.8 1.2 7.1 17.8  8.4 25.5 5.8 1991

nrow(df)

## [1] 30

colnames(df)

##  [1] "Country" "Group"   "AGR"     "MIN"     "MAN"     "PS"      "CON"    
##  [8] "SER"     "FIN"     "SPS"     "TC"      "YEAR"

unique(df$Group)

## [1] "EU"      "EFTA"    "Eastern" "Other"

There are total of 30 observations of 11 variables.

Note that we can only apply factor analysis on certain groups. Therefore, either we need to choose a certain group among those 4 groups,or we can choose the whole data set, just dropping “Group”" column. We will also be dropping the “YEAR” column.

newdf <- df[, 3:11]
head(newdf)

##    AGR MIN  MAN  PS CON  SER  FIN  SPS  TC
## 1  2.6 0.2 20.8 0.8 6.3 16.9  8.7 36.9 6.8
## 2  5.6 0.1 20.4 0.7 6.4 14.5  9.1 36.3 7.0
## 3  5.1 0.3 20.2 0.9 7.1 16.7 10.2 33.1 6.4
## 4  3.2 0.7 24.8 1.0 9.4 17.2  9.6 28.4 5.6
## 5 22.2 0.5 19.2 1.0 6.8 18.2  5.3 19.8 6.9
## 6 13.8 0.6 19.8 1.2 7.1 17.8  8.4 25.5 5.8

Principal Component Method (PCM)

There are several methods of estimating the factor loadings and commonalities, including principal component method, and principal factor method and maximum likelihood estimation. The principal component method is the most common approaches for the estimation.

We first find the correlation matrix from a sample data, and then find the estimator using Eigen decomposition.

Correlation Matrix

C <- cor(newdf)

corrplot((C) , method = "number")

Eigen Values

eig <- eigen(C)
round(eig$values, 3)

## [1] 3.112 1.809 1.496 1.063 0.710 0.311 0.293 0.204 0.000

The common rule of thumb is to select components whose Eigenvalue is at least 1. So, applying this rule, I will choose the first 4 components.

plot(eig$values, xlab="Index", ylab="Eigenvalues")
abline(h=1, col="red")

ap <- parallel(subject=nrow(newdf),var=ncol(newdf), rep=100,cent=.05)
nS <- nScree(eig$values, ap$eigen$qevpea)
plotnScree(nS)

Once again, the first first 4 Eigenvalues whose values over 1 called “strong factor” and the others called “weak factor”, so we will be dropping the weak factors(components).

colnames(df[,3:11])

## [1] "AGR" "MIN" "MAN" "PS"  "CON" "SER" "FIN" "SPS" "TC"

round(eig$values/sum(eig$values), 3)

## [1] 0.346 0.201 0.166 0.118 0.079 0.035 0.033 0.023 0.000

round(cumsum(eig$values/sum(eig$values)), 3)

## [1] 0.346 0.547 0.713 0.831 0.910 0.945 0.977 1.000 1.000

We can see that the first 4 Eigenvalues account for 83% of the original variables.so, we can use these for components with little loss information.

We can find loadings and interpret them for factor analysis.

Loadings

EV <- eig$vectors[, 1:4]
Diag <- diag(sqrt(eig$values[ 1:4]))

loadings <- EV %*% Diag
round(loadings, 2)

##        [,1]  [,2]  [,3]  [,4]
##  [1,] -0.90  0.03 -0.34  0.02
##  [2,] -0.66  0.00  0.63  0.12
##  [3,]  0.43 -0.58 -0.61  0.06
##  [4,]  0.56 -0.15 -0.36  0.02
##  [5,]  0.39  0.33  0.09  0.81
##  [6,]  0.67  0.55  0.08  0.17
##  [7,]  0.23  0.74 -0.12 -0.50
##  [8,]  0.76 -0.07  0.44 -0.33
##  [9,]  0.36 -0.69  0.51 -0.04

colnames(df[,3:11])

## [1] "AGR" "MIN" "MAN" "PS"  "CON" "SER" "FIN" "SPS" "TC"

Interpretation:

For the 1st factor, we can see that the countries that have advanced in agriculture and mining, are more backward in Service and Social services.

It is difficult to interpret to 2nd factor, so that we may apply rotation method later.

We can also look at the row-wise. For example, agriculture is mostly explained by the first factor and mining is explained first and third factor. There is no variable that is explained by only one factor.

Commonalities

commonalities <- apply(loadings^2, 1, sum)
round(commonalities, 2)

## [1] 0.93 0.85 0.91 0.46 0.92 0.79 0.87 0.88 0.87

Commonality is the proportion of each variable’s variance that can be explained by the factor analysis.

In this example, Commonalities show that most of the variables largely explained by the 4 factors, except for PS (power). For example, 93% of the variance of Agriculture and and 85% of the variance of Mining is explained by first 4 factors, and so on.

Principal Factors Method (PFM)

Rinv <- solve(C); #Calculate the inverse of Correlation matrix

Psi <- diag(1/diag(Rinv)) # Replace the diagonal of R with estimated commonalities

RminusPsi <- C - Psi; # difference between Correlation matrix and Psi

# The factor loadings are calculated by finding Eigenvalues and Eigenvector

eig2 <- eigen(RminusPsi) # find Eigenvalues of the RminusPsi matrix

EV2 <- eig2$vectors[, 1:4]
Diag2 <- diag(sqrt(eig2$values[ 1:4]))

loadings2 <- EV2 %*% Diag2

round(loadings2,2)

##        [,1]  [,2]  [,3]  [,4]
##  [1,] -0.90  0.03 -0.34  0.02
##  [2,] -0.66  0.00  0.63  0.12
##  [3,]  0.43 -0.58 -0.61  0.06
##  [4,]  0.56 -0.15 -0.36  0.02
##  [5,]  0.39  0.33  0.09  0.81
##  [6,]  0.67  0.55  0.08  0.17
##  [7,]  0.23  0.74 -0.12 -0.50
##  [8,]  0.76 -0.07  0.44 -0.33
##  [9,]  0.36 -0.69  0.50 -0.04

Interpretation:

Loadings are identical to what we get from the Principal Component method. So, we can do a similar interpretation.

commonalities2 <- apply(loadings2^2, 1, sum)
round(commonalities2,2)

## [1] 0.93 0.85 0.91 0.46 0.92 0.79 0.87 0.88 0.87

Commonalities are also same.

Maximum Likelihood Method

We assume that X ~ MVN(0,LL^T,+Psi) where Psi= diag(R − LL^T) and need to find (L,Psi) that maximizes likelihood.

Note that;

• Typically the elements of (R − LL^T) will be smaller for the maximum likelihood estimate than those from the PCM or PFM.
• However, the cumulative proportion of the total variation explained by the factors will be larger for the PCM and PFM.

factanal(as.matrix(newdf), 4, rotation = "none")

## 
## Call:
## factanal(x = as.matrix(newdf), factors = 4, rotation = "none")
## 
## Uniquenesses:
##   AGR   MIN   MAN    PS   CON   SER   FIN   SPS    TC 
## 0.005 0.005 0.005 0.744 0.601 0.173 0.505 0.005 0.468 
## 
## Loadings:
##     Factor1 Factor2 Factor3 Factor4
## AGR -0.833  -0.443  -0.213  -0.244 
## MIN -0.748   0.514           0.404 
## MAN  0.611  -0.704  -0.117   0.336 
## PS   0.463  -0.128   0.128         
## CON  0.192   0.122   0.589         
## SER  0.493   0.242   0.682  -0.245 
## FIN  0.133   0.187   0.443  -0.496 
## SPS  0.770   0.572  -0.237  -0.136 
## TC   0.398   0.264  -0.330   0.442 
## 
##                Factor1 Factor2 Factor3 Factor4
## SS loadings      2.890   1.477   1.257   0.865
## Proportion Var   0.321   0.164   0.140   0.096
## Cumulative Var   0.321   0.485   0.625   0.721
## 
## Test of the hypothesis that 4 factors are sufficient.
## The chi square statistic is 181.1 on 6 degrees of freedom.
## The p-value is 1.98e-36

Factor Rotation

We know how to choose the number of factors, and now we should know how to rotate the factors.

Factor rotation minimize the complexity of the factor loadings to make the structure simpler to interpret. There are 2 type of rotation:

Orthogonal rotations It assumes that the factors are uncorrelated, then Orthogonal rotation may be applied.

If you wish to begin with the assumption of uncorrelated (independent) factors, you have the following orthogonal rotation options:

Varimax rotation is optimized to reduce cross loadings and to minimize smaller loading values, making factor models clearer.
Quartimax rotation works to reduce the number of variables needed to explain a factor, making interpretation easier.
Equamax rotation offers a compromise between varimax and quartimax.

Oblique rotations It assumes that the factors are correlated, then Oblique rotation may be applied.

If you wish to begin with the assumption of correlated (non-independent) factors, you have two oblique rotation option:

Promax rotation is popular for its ability to handle large datasets efficiently. The approach also tends to result in greater correlation values between factors.
Oblimin rotation approach is somewhat less efficient with large datasets, but can produce a simpler factor structure.

Estimation Using Principal Components

Nfacs <- 4  # number of factors
fit1 <- principal(newdf, Nfacs, rotate="varimax")
fit1$loadings # print results

## 
## Loadings:
##     RC1    RC3    RC2    RC4   
## AGR -0.851 -0.266        -0.356
## MIN -0.108 -0.860 -0.295       
## MAN         0.892 -0.319       
## PS   0.192  0.637         0.137
## CON                       0.954
## SER  0.349  0.153  0.479  0.645
## FIN                0.932       
## SPS  0.911  0.124  0.173       
## TC   0.726        -0.568 -0.142
## 
##                  RC1   RC3   RC2   RC4
## SS loadings    2.260 2.052 1.657 1.512
## Proportion Var 0.251 0.228 0.184 0.168
## Cumulative Var 0.251 0.479 0.663 0.831

Principal Factor Analysis

Nfacs <- 4  # number of factors
fit2 <- factor.pa(newdf, Nfacs, rotate="varimax")

## Warning: factor.pa is deprecated. Please use the fa function with fm=pa

## Warning in fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs = np.obs, :
## The estimated weights for the factor scores are probably incorrect. Try a
## different factor score estimation method.

fit2$loadings  # print results

## 
## Loadings:
##     PA1    PA3    PA2    PA4   
## AGR -0.893  0.266        -0.338
## MIN -0.126  0.816 -0.302       
## MAN        -0.949 -0.342       
## PS   0.210 -0.421         0.120
## CON                       0.910
## SER  0.347 -0.172  0.459  0.511
## FIN                0.806       
## SPS  0.848 -0.132  0.149       
## TC   0.642        -0.545 -0.113
## 
##                  PA1   PA3   PA2   PA4
## SS loadings    2.119 1.866 1.400 1.247
## Proportion Var 0.235 0.207 0.156 0.139
## Cumulative Var 0.235 0.443 0.598 0.737

Estimation Using Maximum Likelihood

Nfacs <- 4  # number of factors

fit3<- factanal(as.matrix(newdf), 4, rotation = "varimax")

fit3$loadings

## 
## Loadings:
##     Factor1 Factor2 Factor3 Factor4
## AGR -0.807  -0.248  -0.528         
## MIN         -0.871          -0.466 
## MAN          0.942          -0.313 
## PS   0.202   0.411   0.213         
## CON                  0.628         
## SER  0.218   0.149   0.782   0.382 
## FIN                  0.430   0.553 
## SPS  0.947                   0.288 
## TC   0.613          -0.101  -0.370 
## 
##                Factor1 Factor2 Factor3 Factor4
## SS loadings      2.025   1.923   1.546   0.995
## Proportion Var   0.225   0.214   0.172   0.111
## Cumulative Var   0.225   0.439   0.610   0.721

Our goal is to create a simpler model, so I choose the varimax rotation method.

You can notice that Principal Component analysis method explains the larger variance (83.1%) with 4 factors compared to other methods (73.7% and 72.1%). But, Maximum likelihood method produce simpler factor structure.

**********************

Factor Analysis

Mustafa Arslan

8/28/2021

DATA ANALYSIS