Factor Analysis

Zoe Zhu

data = read.csv('toothpaste.csv')

Explore Data

The survey consists of responses to six seven-point likert-scale items with anchors, strongly disagree and strongly agree.
* prevents_cavities: It is important to buy a toothpaste that prevents cavities
* shiny_teeth: I like a toothpaste that gives shiny teeth
* strengthens_gums: A toothpaste should strengthen your gums
* freshens_breath: I prefer a toothpaste that freshens breath
* prevent_decay_not_imp: Prevention of tooth decay is not an important benefit offered by a toothpaste
* attractive_teeth: The most important consideration in buying a toothpaste is attractive teeth

survey = data[,2:7]
head(survey)
##   prevents_cavities shiny_teeth strengthens_gums freshens_breath
## 1                 7           3                6               4
## 2                 1           3                2               4
## 3                 6           2                7               4
## 4                 4           5                4               6
## 5                 1           2                2               3
## 6                 6           3                6               4
##   prevent_decay_not_imp attractive_teeth
## 1                     2                4
## 2                     5                4
## 3                     1                3
## 4                     2                5
## 5                     6                2
## 6                     2                4
summary(survey)
##  prevents_cavities  shiny_teeth  strengthens_gums freshens_breath
##  Min.   :1.000     Min.   :2.0   Min.   :1.0      Min.   :2.0    
##  1st Qu.:2.000     1st Qu.:3.0   1st Qu.:2.0      1st Qu.:3.0    
##  Median :4.000     Median :4.0   Median :4.0      Median :4.0    
##  Mean   :3.933     Mean   :3.9   Mean   :4.1      Mean   :4.1    
##  3rd Qu.:6.000     3rd Qu.:5.0   3rd Qu.:6.0      3rd Qu.:5.0    
##  Max.   :7.000     Max.   :7.0   Max.   :7.0      Max.   :7.0    
##  prevent_decay_not_imp attractive_teeth
##  Min.   :1.0           Min.   :2.000   
##  1st Qu.:2.0           1st Qu.:3.000   
##  Median :3.5           Median :4.000   
##  Mean   :3.5           Mean   :4.167   
##  3rd Qu.:5.0           3rd Qu.:4.750   
##  Max.   :7.0           Max.   :7.000

Suitability for Factor Analysis

Correlations

We examine bivariate correlations among responses to the six survey questions.

  • A correlation matrix with some large and some small correlations is ideal for factor analysis.
  • If all correlations are small, there is no way to group variables. On the other hand, if all correlations are large, then all the variables will load onto the same factor.

It is common to assume responses to itemized rating scales, the kind used for this survey, are on an interval scale. Therefore, we will compute pearson’s product moment correlation. Note, if we treat the data as being ordinal (as some researchers would argue), we would examine polychoric correlations.

round(cor(survey), 3)
##                       prevents_cavities shiny_teeth strengthens_gums
## prevents_cavities                 1.000      -0.053            0.873
## shiny_teeth                      -0.053       1.000           -0.155
## strengthens_gums                  0.873      -0.155            1.000
## freshens_breath                  -0.086       0.572           -0.248
## prevent_decay_not_imp            -0.858       0.020           -0.778
## attractive_teeth                  0.004       0.640           -0.018
##                       freshens_breath prevent_decay_not_imp
## prevents_cavities              -0.086                -0.858
## shiny_teeth                     0.572                 0.020
## strengthens_gums               -0.248                -0.778
## freshens_breath                 1.000                -0.007
## prevent_decay_not_imp          -0.007                 1.000
## attractive_teeth                0.640                -0.136
##                       attractive_teeth
## prevents_cavities                0.004
## shiny_teeth                      0.640
## strengthens_gums                -0.018
## freshens_breath                  0.640
## prevent_decay_not_imp           -0.136
## attractive_teeth                 1.000

Since we are only interested in the pattern of correlations, we can look at a heatmap of correlations.

library(corrplot)
corrplot(cor(data[,2:7]),type = 'lower',,col = c('red','white','green'),method = 'square',diag = F)

Here is a plot using ggcorrplot

library(ggcorrplot)
ggcorrplot(cor(data[,2:7]),colors = c('red','white','green'),type = 'lower')

Bartlett’s Test of Sphericity

Looks to see if there are at least some non-zero correlations by comparing correlation matrix to an identity matrix. A significant test indicates suitability for factor analysis.

library(psych)
cortest.bartlett(cor(survey),n = 30)
## $chisq
## [1] 111.3138
## 
## $p.value
## [1] 9.017094e-17
## 
## $df
## [1] 15

KMO Measure of Sampling Adequacy (MSA)

Compares partial correlation matrix to pairwise correlation matrix. A partial correlation is a correlation after partialing out all other correlations. If the variables are strongly related, partial correlations should be small and MSA close to 1. If MSA > 0.5, data is suitable for factor analysis.

KMO(r = cor(survey))
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = cor(survey))
## Overall MSA =  0.66
## MSA for each item = 
##     prevents_cavities           shiny_teeth      strengthens_gums 
##                  0.62                  0.70                  0.68 
##       freshens_breath prevent_decay_not_imp      attractive_teeth 
##                  0.64                  0.77                  0.56

Determine Number of Factors

It is critical to have an a priori expectation of the number of factors.

Now, let us examine some data-driven methods used to corroborate an priori solution or select from a set of a priori solutions.

Scree Plot

Line graph of eigen values for each factor. Ideal number of factors is indicated by a sudden change in the line graph or what is known as the elbow.

scree(cor(survey),factors = T,pc = F)

Eigen Value

According to the eigen-value criterion, all factors with eigen value greater than 1 are selected.

data.frame(factor = 1:ncol(survey), eigen = scree(cor(survey),factors = T,pc = F)$fv)

##   factor       eigen
## 1      1  2.53970272
## 2      2  1.27058351
## 3      3  0.01617005
## 4      4 -0.05397856
## 5      5 -0.55820744
## 6      6 -0.67456755

Parallel Analysis

Simulate a dataset with same variables and observations as original dataset. Compute correlation matrix and eigen values. Now, compare eigen values from simulated data to original data. Select factors with eigen values in the original data greater than eigen values in the simulated data.

fa.parallel(survey,fa='fa',fm = 'pa')

## Parallel analysis suggests that the number of factors =  2  and the number of components =  NA

Total Variance Explained

To ensure that the factors represents the original variables sufficiently well, the total variance explained by factors should be greater than 70%.

Since the three above tests corroborated the a priori two-factor solution, we will now run an exploratory factor analysis using principal axis factoring with two factors. Next, we examine the Cumulative Variance explained by the two factors.

result = fa(r = survey,nfactors = 2,fm = 'pa',rotate = 'none')
result$Vaccounted
##                             PA1       PA2
## SS loadings           2.5700755 1.8667388
## Proportion Var        0.4283459 0.3111231
## Cumulative Var        0.4283459 0.7394691
## Proportion Explained  0.5792615 0.4207385
## Cumulative Proportion 0.5792615 1.0000000

Extracted Communalities

  • Communality reflects the amount of variance in a variable that can be explained by the factors.
  • Larger the communality, the more of the variable is captured by the factor solution.
  • On the other hand, a small communality implies most of the variance in the variable was not captured. Ideally, communality of each variable must be greater than 0.7, but a communality greater than 0.5 may be seen as acceptable.
data.frame(communality = result$communality)
##                       communality
## prevents_cavities       0.9266887
## shiny_teeth             0.5625047
## strengthens_gums        0.8370333
## freshens_breath         0.6016147
## prevent_decay_not_imp   0.7898264
## attractive_teeth        0.7191465

Mapping Variables to Factors

Each variable is represented as a linear combination of factors. An ideal factor solution is where each variable is expected to load on (i.e., related to) only one factor. Such a result is easy to interpret. In practice, each variable may load on many factors. This may still be acceptable so long as the loading on one factor is large and on all other factors is small.

When the pattern of loadings does not show a clear preference of a variable for a factor, rotating the axes may help generate a clear mapping. There are two broad types of axes rotation
* Orthogonal: Axes are rotated while constraining them to be at right angles. E.g., varimax, quartimax, equimax
* Oblique: Axes are allowed to have any angle between them. E.g., oblimin, promax

Here is the pattern of loadings for the unrotated solution (i.e., rotation=’none)

print(result$loadings, cut=0)
## 
## Loadings:
##                       PA1    PA2   
## prevents_cavities      0.948  0.168
## shiny_teeth           -0.207  0.721
## strengthens_gums       0.914  0.039
## freshens_breath       -0.247  0.735
## prevent_decay_not_imp -0.850 -0.260
## attractive_teeth      -0.101  0.842
## 
##                  PA1   PA2
## SS loadings    2.570 1.867
## Proportion Var 0.428 0.311
## Cumulative Var 0.428 0.739

To make the matrix of loadings easier to interpret, let us exclude small loadings, say below 0.15. Note, four of the six variables load on both factors.

print(result$loadings, cut=0.15)
## 
## Loadings:
##                       PA1    PA2   
## prevents_cavities      0.948  0.168
## shiny_teeth           -0.207  0.721
## strengthens_gums       0.914       
## freshens_breath       -0.247  0.735
## prevent_decay_not_imp -0.850 -0.260
## attractive_teeth              0.842
## 
##                  PA1   PA2
## SS loadings    2.570 1.867
## Proportion Var 0.428 0.311
## Cumulative Var 0.428 0.739

Now, let’s examine the matrix after an orthogonal rotation using varimax. Is the mapping of variables to factors more clear?

fa_varimax = fa(r = survey,nfactors = 2,fm = 'pa',rotate = 'varimax')
print(fa_varimax$loadings,cut=0.15)
## 
## Loadings:
##                       PA1    PA2   
## prevents_cavities      0.962       
## shiny_teeth                   0.748
## strengthens_gums       0.902 -0.155
## freshens_breath               0.771
## prevent_decay_not_imp -0.886       
## attractive_teeth              0.844
## 
##                  PA1   PA2
## SS loadings    2.539 1.898
## Proportion Var 0.423 0.316
## Cumulative Var 0.423 0.739

Now, let’s try an oblique rotation using oblimin. What do you think of the mapping of variables to factors?

fa_oblimin = fa(r = survey,nfactors = 2,fm = 'pa',rotate = 'oblimin')
print(fa_oblimin$loadings,cut=0.15)
## 
## Loadings:
##                       PA1    PA2   
## prevents_cavities      0.948  0.168
## shiny_teeth           -0.207  0.721
## strengthens_gums       0.914       
## freshens_breath       -0.247  0.735
## prevent_decay_not_imp -0.850 -0.260
## attractive_teeth              0.842
## 
##                  PA1   PA2
## SS loadings    2.570 1.867
## Proportion Var 0.428 0.311
## Cumulative Var 0.428 0.739

Interpretation

Review pattern of factor loadings from the rotated matrix to interpret the factor analysis solution. The meaning of each factor is derived from the variables loading on it. So, let’s review the variables to describe each factor.

Since oblimin offered the clearest mapping of variables to factors, that matrix is used. To make interpretation easier, matrix is sorted.

What need is reflected in the first factor, PA1? What need is reflected by the second factor, PA2? Are these consistent with our a priori expectation?

print(fa_oblimin$loadings,cut=0.15, sort=T)
## 
## Loadings:
##                       PA1    PA2   
## prevents_cavities      0.948  0.168
## strengthens_gums       0.914       
## prevent_decay_not_imp -0.850 -0.260
## shiny_teeth           -0.207  0.721
## freshens_breath       -0.247  0.735
## attractive_teeth              0.842
## 
##                  PA1   PA2
## SS loadings    2.570 1.867
## Proportion Var 0.428 0.311
## Cumulative Var 0.428 0.739
fa.diagram(fa_oblimin,sort = T)

Representing the Factor

If the goal is to use the factors in further analysis, then they may be represented in one of three ways

Average scores of variables reflecting the factor

factor1_avg = rowMeans(survey[,c('prevents_cavities','strengthens_gums','prevent_decay_not_imp')])
factor2_avg = rowMeans(survey[,c('attractive_teeth','freshens_breath','shiny_teeth')]) 

Weighted average of variables reflecting the factor, where weights are the factor loadings

factor1_score = fa_oblimin$scores[,'PA1']
factor2_score = fa_oblimin$scores[,'PA2']

Pick a variable as a representative of the factor. Here, we are selecting the variable with the largest factor loading.

factor1_surrogate = survey[,'prevents_cavities']
factor2_surrogate = survey[,'attractive_teeth']