Factor analysis is one of the more popular data reduction methods.It is often used in marketing research to summarize the relationships involving a set of variables.

In other words, we can use factor analysis to assess the correlation between variables and this statistical technique is most often used in large datasets, where there are too many inputs and we want to find out which inputs are important in affecting our target variable, Y.

Create sample data

First step, let’s create an example of dataset and take a first look at the first few observations using head function. Let’s say that these are all the product attributes tested in a market research survey and we want to find out which variables are important.

data <- data.frame(replicate(10, sample(1:5, 1000, rep=TRUE)))
names(data) <- c("Taste", "Value for money", "Color", "Size", "Volume", "Brand", 
                 "Quality", "Promotion", "Store Location", "Domestic")
summary(data)
##      Taste       Value for money     Color            Size      
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:2.000   1st Qu.:2.000   1st Qu.:2.000   1st Qu.:2.000  
##  Median :3.000   Median :3.000   Median :3.000   Median :3.000  
##  Mean   :3.045   Mean   :3.109   Mean   :3.046   Mean   :2.909  
##  3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
##      Volume          Brand          Quality       Promotion    
##  Min.   :1.000   Min.   :1.000   Min.   :1.00   Min.   :1.000  
##  1st Qu.:2.000   1st Qu.:2.000   1st Qu.:2.00   1st Qu.:2.000  
##  Median :3.000   Median :3.000   Median :3.00   Median :3.000  
##  Mean   :2.919   Mean   :2.921   Mean   :2.99   Mean   :2.962  
##  3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:4.00   3rd Qu.:4.000  
##  Max.   :5.000   Max.   :5.000   Max.   :5.00   Max.   :5.000  
##  Store Location     Domestic    
##  Min.   :1.000   Min.   :1.000  
##  1st Qu.:2.000   1st Qu.:2.000  
##  Median :3.000   Median :3.000  
##  Mean   :3.058   Mean   :2.955  
##  3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :5.000   Max.   :5.000

Step 1

Understanding data

Use summary to get an overview of each column in details.

library(Hmisc) #explotary data analysis 
library(psych) #factor analysis, PCA, cluster and reliability analysis
library(nFactors) #determine no of factors 
library(ggrepel) #avoid overlapping text on scatterplot

Since our dataset is in data frame, we will need to use convert it to matrix before running correlation test. You may be getting different correlation score as we are generating random numbers in our dataset.

P values are shown after the correlation score, another test to indicate whether the correlation score is significant. Depending on the sig.level you want to test, for a 95% sig.level you will need to have p-value of 0.5 or below - the smaller, the better.

data.cor <- rcorr(as.matrix(data))
data.cor
##                 Taste Value for money Color  Size Volume Brand Quality
## Taste            1.00           -0.05  0.03 -0.02  -0.03 -0.03   -0.02
## Value for money -0.05            1.00  0.03 -0.01   0.06  0.00    0.01
## Color            0.03            0.03  1.00 -0.03   0.00  0.03   -0.01
## Size            -0.02           -0.01 -0.03  1.00   0.03  0.00   -0.07
## Volume          -0.03            0.06  0.00  0.03   1.00 -0.02    0.01
## Brand           -0.03            0.00  0.03  0.00  -0.02  1.00    0.04
## Quality         -0.02            0.01 -0.01 -0.07   0.01  0.04    1.00
## Promotion        0.00            0.01  0.01 -0.05  -0.03 -0.05   -0.01
## Store Location   0.01            0.01  0.01  0.04   0.04 -0.01    0.04
## Domestic         0.05            0.00  0.03  0.00   0.01  0.02    0.04
##                 Promotion Store Location Domestic
## Taste                0.00           0.01     0.05
## Value for money      0.01           0.01     0.00
## Color                0.01           0.01     0.03
## Size                -0.05           0.04     0.00
## Volume              -0.03           0.04     0.01
## Brand               -0.05          -0.01     0.02
## Quality             -0.01           0.04     0.04
## Promotion            1.00           0.02    -0.04
## Store Location       0.02           1.00    -0.03
## Domestic            -0.04          -0.03     1.00
## 
## n= 1000 
## 
## 
## P
##                 Taste  Value for money Color  Size   Volume Brand  Quality
## Taste                  0.0894          0.4086 0.5553 0.4027 0.3174 0.5944 
## Value for money 0.0894                 0.3342 0.7836 0.0607 0.8897 0.8225 
## Color           0.4086 0.3342                 0.2999 0.9666 0.4009 0.8232 
## Size            0.5553 0.7836          0.2999        0.4032 0.9459 0.0351 
## Volume          0.4027 0.0607          0.9666 0.4032        0.5488 0.7320 
## Brand           0.3174 0.8897          0.4009 0.9459 0.5488        0.1713 
## Quality         0.5944 0.8225          0.8232 0.0351 0.7320 0.1713        
## Promotion       0.8908 0.8560          0.8659 0.1123 0.3429 0.1439 0.7594 
## Store Location  0.6663 0.8113          0.7497 0.1739 0.2030 0.7878 0.2389 
## Domestic        0.1527 0.9988          0.3886 0.9608 0.7370 0.5844 0.1830 
##                 Promotion Store Location Domestic
## Taste           0.8908    0.6663         0.1527  
## Value for money 0.8560    0.8113         0.9988  
## Color           0.8659    0.7497         0.3886  
## Size            0.1123    0.1739         0.9608  
## Volume          0.3429    0.2030         0.7370  
## Brand           0.1439    0.7878         0.5844  
## Quality         0.7594    0.2389         0.1830  
## Promotion                 0.5919         0.2416  
## Store Location  0.5919                   0.3313  
## Domestic        0.2416    0.3313
data.pca <- princomp(data, cor=TRUE)
summary(data.pca)
## Importance of components:
##                           Comp.1    Comp.2    Comp.3    Comp.4    Comp.5
## Standard deviation     1.0574265 1.0543955 1.0388577 1.0247028 1.0142505
## Proportion of Variance 0.1118151 0.1111750 0.1079225 0.1050016 0.1028704
## Cumulative Proportion  0.1118151 0.2229901 0.3309126 0.4359142 0.5387846
##                           Comp.6     Comp.7     Comp.8     Comp.9
## Standard deviation     1.0007753 0.96686361 0.95505903 0.95068124
## Proportion of Variance 0.1001551 0.09348252 0.09121377 0.09037948
## Cumulative Proportion  0.6389397 0.73242222 0.82363600 0.91401548
##                           Comp.10
## Standard deviation     0.92727840
## Proportion of Variance 0.08598452
## Cumulative Proportion  1.00000000
plot(data.pca, col="blue", main="Principal Components")

Step 2

Use factanal and let’s consider 3 factors given the small dataset. It looks like most of the attributes are unique, given the high scoring on uniqueness. It seems like Domestic attribute loads on factor 1, and Size loads on factor 2. In other words, these are the factors driving how participants answer question on Volume, Color. Loadings are simply correlation scores of unobserved factors.

It is not a surprising result given that most of the attributes are unique, therefore we are only getting one attribute in each factor. Generally we would be getting few attributes in each factor in an applicable dataset, which you can group these attributes together and rename it (i.e. factor1) to a variable name and use it for further analysis. This would help to reduce the number of variables use in your predictive modelling.

The high p-value concludes that we do not reject the null hypothesis, and we can safely say 3 factors are sufficient. Howevever given the uniqueness of the attributes and lack of correlation, you may want to consider different modelling techniques. The 3 factors capture only 7% of the variance originally observed between the 10 variables.

factors.3 <- factanal(data, factors=3, rotation="varimax")
print(factors.3, digits=2 , cutoff=0.3, sort=TRUE)
## 
## Call:
## factanal(x = data, factors = 3, rotation = "varimax")
## 
## Uniquenesses:
##           Taste Value for money           Color            Size 
##            0.97            0.93            1.00            0.00 
##          Volume           Brand         Quality       Promotion 
##            0.95            1.00            0.99            0.99 
##  Store Location        Domestic 
##            0.99            0.00 
## 
## Loadings:
##                 Factor1 Factor2 Factor3
## Domestic         0.92                  
## Size             0.34    0.90          
## Taste                                  
## Value for money                        
## Color                                  
## Volume                                 
## Brand                                  
## Quality                                
## Promotion                              
## Store Location                         
## 
##                Factor1 Factor2 Factor3
## SS loadings       0.98    0.91    0.28
## Proportion Var    0.10    0.09    0.03
## Cumulative Var    0.10    0.19    0.22
## 
## Test of the hypothesis that 3 factors are sufficient.
## The chi square statistic is 12.29 on 18 degrees of freedom.
## The p-value is 0.832

You can also use xy plot to visualize the loading scores.

#plot factor 1 by factor 2
load <- as.data.frame(factors.3$loadings[, 1:2])
ggplot(load, aes(load$Factor1, load$Factor2, label=rownames(load)))+
  geom_jitter(color="red")+geom_label_repel(aes(fill=rownames(load)), show.legend = F)

Step 3

If let’s say the 3 factors analysis is correct, then you may proceed to run regression scores below and it will show the standardized observed values of the obs in the estimated factors.

data.output <- factanal(data, factors=3, rotation="varimax", scores="regression")
head(data.output$scores)
##         Factor1    Factor2    Factor3
## [1,] -1.1372319 -0.9163236 -0.4806894
## [2,]  0.4021581 -0.6293783 -0.7702162
## [3,] -0.7670749  1.6677243  0.8542041
## [4,]  0.4250934  1.5607908 -0.2723919
## [5,]  1.9457862  0.7145322  0.6011656
## [6,]  1.4736076 -0.5764311  0.2873648