Factor Analysis applied in Multiple Linear Regression Context using R

Questions:

1- Is it possible that we can run the regression analysis on factors generating from Factor Analysis?
2- If so, how can we run it?
3- What is difference between FA and PCA?

The demonstration is to use the dataset Factor-Hair-Revised.csv to build a regression model to predict satisfaction.

Data Exploration

prodsurvey <- read_csv("Factor-Hair-Revised.csv")

## Parsed with column specification:
## cols(
##   ID = col_double(),
##   ProdQual = col_double(),
##   Ecom = col_double(),
##   TechSup = col_double(),
##   CompRes = col_double(),
##   Advertising = col_double(),
##   ProdLine = col_double(),
##   SalesFImage = col_double(),
##   ComPricing = col_double(),
##   WartyClaim = col_double(),
##   OrdBilling = col_double(),
##   DelSpeed = col_double(),
##   Satisfaction = col_double()
## )

head(prodsurvey)

## # A tibble: 6 x 13
##      ID ProdQual  Ecom TechSup CompRes Advertising ProdLine SalesFImage
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl>       <dbl>    <dbl>       <dbl>
## 1     1      8.5   3.9     2.5     5.9         4.8      4.9         6  
## 2     2      8.2   2.7     5.1     7.2         3.4      7.9         3.1
## 3     3      9.2   3.4     5.6     5.6         5.4      7.4         5.8
## 4     4      6.4   3.3     7       3.7         4.7      4.7         4.5
## 5     5      9     3.4     5.2     4.6         2.2      6           4.5
## 6     6      6.5   2.8     3.1     4.1         4        4.3         3.7
## # ... with 5 more variables: ComPricing <dbl>, WartyClaim <dbl>,
## #   OrdBilling <dbl>, DelSpeed <dbl>, Satisfaction <dbl>

dim(prodsurvey)

## [1] 100  13

str(prodsurvey)

## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 100 obs. of  13 variables:
##  $ ID          : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ ProdQual    : num  8.5 8.2 9.2 6.4 9 6.5 6.9 6.2 5.8 6.4 ...
##  $ Ecom        : num  3.9 2.7 3.4 3.3 3.4 2.8 3.7 3.3 3.6 4.5 ...
##  $ TechSup     : num  2.5 5.1 5.6 7 5.2 3.1 5 3.9 5.1 5.1 ...
##  $ CompRes     : num  5.9 7.2 5.6 3.7 4.6 4.1 2.6 4.8 6.7 6.1 ...
##  $ Advertising : num  4.8 3.4 5.4 4.7 2.2 4 2.1 4.6 3.7 4.7 ...
##  $ ProdLine    : num  4.9 7.9 7.4 4.7 6 4.3 2.3 3.6 5.9 5.7 ...
##  $ SalesFImage : num  6 3.1 5.8 4.5 4.5 3.7 5.4 5.1 5.8 5.7 ...
##  $ ComPricing  : num  6.8 5.3 4.5 8.8 6.8 8.5 8.9 6.9 9.3 8.4 ...
##  $ WartyClaim  : num  4.7 5.5 6.2 7 6.1 5.1 4.8 5.4 5.9 5.4 ...
##  $ OrdBilling  : num  5 3.9 5.4 4.3 4.5 3.6 2.1 4.3 4.4 4.1 ...
##  $ DelSpeed    : num  3.7 4.9 4.5 3 3.5 3.3 2 3.7 4.6 4.4 ...
##  $ Satisfaction: num  8.2 5.7 8.9 4.8 7.1 4.7 5.7 6.3 7 5.5 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ID = col_double(),
##   ..   ProdQual = col_double(),
##   ..   Ecom = col_double(),
##   ..   TechSup = col_double(),
##   ..   CompRes = col_double(),
##   ..   Advertising = col_double(),
##   ..   ProdLine = col_double(),
##   ..   SalesFImage = col_double(),
##   ..   ComPricing = col_double(),
##   ..   WartyClaim = col_double(),
##   ..   OrdBilling = col_double(),
##   ..   DelSpeed = col_double(),
##   ..   Satisfaction = col_double()
##   .. )

names(prodsurvey)

##  [1] "ID"           "ProdQual"     "Ecom"         "TechSup"      "CompRes"     
##  [6] "Advertising"  "ProdLine"     "SalesFImage"  "ComPricing"   "WartyClaim"  
## [11] "OrdBilling"   "DelSpeed"     "Satisfaction"

describe(prodsurvey)

##              vars   n  mean    sd median trimmed   mad min   max range  skew
## ID              1 100 50.50 29.01  50.50   50.50 37.06 1.0 100.0  99.0  0.00
## ProdQual        2 100  7.81  1.40   8.00    7.85  1.78 5.0  10.0   5.0 -0.24
## Ecom            3 100  3.67  0.70   3.60    3.63  0.52 2.2   5.7   3.5  0.64
## TechSup         4 100  5.37  1.53   5.40    5.40  1.85 1.3   8.5   7.2 -0.20
## CompRes         5 100  5.44  1.21   5.45    5.46  1.26 2.6   7.8   5.2 -0.13
## Advertising     6 100  4.01  1.13   4.00    4.00  1.19 1.9   6.5   4.6  0.04
## ProdLine        7 100  5.80  1.32   5.75    5.81  1.56 2.3   8.4   6.1 -0.09
## SalesFImage     8 100  5.12  1.07   4.90    5.09  0.89 2.9   8.2   5.3  0.37
## ComPricing      9 100  6.97  1.55   7.10    7.01  1.93 3.7   9.9   6.2 -0.23
## WartyClaim     10 100  6.04  0.82   6.10    6.04  0.89 4.1   8.1   4.0  0.01
## OrdBilling     11 100  4.28  0.93   4.40    4.31  0.74 2.0   6.7   4.7 -0.32
## DelSpeed       12 100  3.89  0.73   3.90    3.92  0.74 1.6   5.5   3.9 -0.45
## Satisfaction   13 100  6.92  1.19   7.05    6.90  1.33 4.7   9.9   5.2  0.08
##              kurtosis   se
## ID              -1.24 2.90
## ProdQual        -1.17 0.14
## Ecom             0.57 0.07
## TechSup         -0.63 0.15
## CompRes         -0.66 0.12
## Advertising     -0.94 0.11
## ProdLine        -0.60 0.13
## SalesFImage      0.26 0.11
## ComPricing      -0.96 0.15
## WartyClaim      -0.53 0.08
## OrdBilling       0.11 0.09
## DelSpeed         0.09 0.07
## Satisfaction    -0.86 0.12

datamatrix1 <- cor(prodsurvey[,-1])
corrplot(datamatrix1, method="number")

As we can see from the above correlation matrix: 1. Complaint resolution (CompRes) and delivery speed (DelSpeed) are highly correlated 2. Order billing (OrdBilling) and CompRes are highly correlated 3. WartyClaim and TechSupport are highly correlated 4. OrdBilling and DelSpeed are highly correlated 5. e-commerce (Ecom) and sales force image (SalesFImage) are highly correlated

pcor(prodsurvey[,-1], method="pearson")

## $estimate
##                  ProdQual          Ecom      TechSup      CompRes   Advertising
## ProdQual      1.000000000  0.1549597037  0.002424222 -0.056354142  0.1123767461
## Ecom          0.154959704  1.0000000000  0.082359000 -0.033513777 -0.0002972504
## TechSup       0.002424222  0.0823590001  1.000000000  0.143603415 -0.0593002540
## CompRes      -0.056354142 -0.0335137768  0.143603415  1.000000000 -0.0648060931
## Advertising   0.112376746 -0.0002972504 -0.059300254 -0.064806093  1.0000000000
## ProdLine      0.281144724  0.1538660545 -0.125349050  0.020305650 -0.1319231866
## SalesFImage  -0.376228551  0.7321011890 -0.093310796 -0.022150933  0.2628793429
## ComPricing   -0.014021386  0.0149131857 -0.132972485 -0.004487151 -0.0632980381
## WartyClaim   -0.042752416 -0.1232553822  0.787729606 -0.109686211  0.0275909587
## OrdBilling    0.054523215  0.1546990521 -0.165914724  0.288344382 -0.0326580430
## DelSpeed     -0.335084210 -0.0083917930 -0.021914916  0.528932328  0.2046544064
## Satisfaction  0.607438787 -0.3308440834  0.055111867  0.172416347 -0.0449735411
##                 ProdLine  SalesFImage   ComPricing  WartyClaim  OrdBilling
## ProdQual      0.28114472 -0.376228551 -0.014021386 -0.04275242  0.05452322
## Ecom          0.15386605  0.732101189  0.014913186 -0.12325538  0.15469905
## TechSup      -0.12534905 -0.093310796 -0.132972485  0.78772961 -0.16591472
## CompRes       0.02030565 -0.022150933 -0.004487151 -0.10968621  0.28834438
## Advertising  -0.13192319  0.262879343 -0.063298038  0.02759096 -0.03265804
## ProdLine      1.00000000 -0.230170570 -0.361757904  0.25718360 -0.28098068
## SalesFImage  -0.23017057  1.000000000  0.126612489  0.18930027 -0.18235930
## ComPricing   -0.36175790  0.126612489  1.000000000  0.02035155 -0.08650087
## WartyClaim    0.25718360  0.189300271  0.020351554  1.00000000  0.25984451
## OrdBilling   -0.28098068 -0.182359301 -0.086500869  0.25984451  1.00000000
## DelSpeed      0.50189505  0.005889925  0.190437993 -0.09164900  0.35053021
## Satisfaction  0.18325156  0.660251850 -0.087467711 -0.08868071  0.14880055
##                  DelSpeed Satisfaction
## ProdQual     -0.335084210   0.60743879
## Ecom         -0.008391793  -0.33084408
## TechSup      -0.021914916   0.05511187
## CompRes       0.528932328   0.17241635
## Advertising   0.204654406  -0.04497354
## ProdLine      0.501895051   0.18325156
## SalesFImage   0.005889925   0.66025185
## ComPricing    0.190437993  -0.08746771
## WartyClaim   -0.091649002  -0.08868071
## OrdBilling    0.350530212   0.14880055
## DelSpeed      1.000000000   0.08955614
## Satisfaction  0.089556143   1.00000000
## 
## $p.value
##                  ProdQual         Ecom      TechSup      CompRes Advertising
## ProdQual     0.000000e+00 1.447435e-01 9.819081e-01 5.977987e-01  0.29163501
## Ecom         1.447435e-01 0.000000e+00 4.402823e-01 7.538375e-01  0.99778145
## TechSup      9.819081e-01 4.402823e-01 0.000000e+00 1.769171e-01  0.57876015
## CompRes      5.977987e-01 7.538375e-01 1.769171e-01 0.000000e+00  0.54395113
## Advertising  2.916350e-01 9.977814e-01 5.787601e-01 5.439511e-01  0.00000000
## ProdLine     7.269030e-03 1.476353e-01 2.391187e-01 8.493380e-01  0.21517368
## SalesFImage  2.576212e-04 2.442012e-16 3.817042e-01 8.358301e-01  0.01230672
## ComPricing   8.956443e-01 8.890480e-01 2.115149e-01 9.665194e-01  0.55338281
## WartyClaim   6.890839e-01 2.471182e-01 3.262868e-20 3.034147e-01  0.79629803
## OrdBilling   6.097697e-01 1.454288e-01 1.180864e-01 5.850710e-03  0.75993047
## DelSpeed     1.245192e-03 9.374303e-01 8.375553e-01 8.359069e-08  0.05300070
## Satisfaction 2.182238e-10 1.447603e-03 6.059096e-01 1.041593e-01  0.67382405
##                  ProdLine  SalesFImage   ComPricing   WartyClaim   OrdBilling
## ProdQual     7.269030e-03 2.576212e-04 0.8956443272 6.890839e-01 0.6097697424
## Ecom         1.476353e-01 2.442012e-16 0.8890479591 2.471182e-01 0.1454287628
## TechSup      2.391187e-01 3.817042e-01 0.2115148546 3.262868e-20 0.1180864174
## CompRes      8.493380e-01 8.358301e-01 0.9665194173 3.034147e-01 0.0058507099
## Advertising  2.151737e-01 1.230672e-02 0.5533828069 7.962980e-01 0.7599304715
## ProdLine     0.000000e+00 2.907563e-02 0.0004593443 1.440249e-02 0.0073046065
## SalesFImage  2.907563e-02 0.000000e+00 0.2343791374 7.394541e-02 0.0853804743
## ComPricing   4.593443e-04 2.343791e-01 0.0000000000 8.490015e-01 0.4175555780
## WartyClaim   1.440249e-02 7.394541e-02 0.8490014635 0.000000e+00 0.0133877361
## OrdBilling   7.304606e-03 8.538047e-02 0.4175555780 1.338774e-02 0.0000000000
## DelSpeed     4.664278e-07 9.560619e-01 0.0721942454 3.902770e-01 0.0007064114
## Satisfaction 8.383625e-02 1.449719e-12 0.4123500152 4.058729e-01 0.1615974575
##                  DelSpeed Satisfaction
## ProdQual     1.245192e-03 2.182238e-10
## Ecom         9.374303e-01 1.447603e-03
## TechSup      8.375553e-01 6.059096e-01
## CompRes      8.359069e-08 1.041593e-01
## Advertising  5.300070e-02 6.738241e-01
## ProdLine     4.664278e-07 8.383625e-02
## SalesFImage  9.560619e-01 1.449719e-12
## ComPricing   7.219425e-02 4.123500e-01
## WartyClaim   3.902770e-01 4.058729e-01
## OrdBilling   7.064114e-04 1.615975e-01
## DelSpeed     0.000000e+00 4.012356e-01
## Satisfaction 4.012356e-01 0.000000e+00
## 
## $statistic
##                 ProdQual         Ecom     TechSup     CompRes  Advertising
## ProdQual      0.00000000  1.471424516  0.02274128 -0.52949015  1.060907458
## Ecom          1.47142452  0.000000000  0.77522957 -0.31456380 -0.002788456
## TechSup       0.02274128  0.775229571  0.00000000  1.36122814 -0.557266374
## CompRes      -0.52949015 -0.314563798  1.36122814  0.00000000 -0.609215688
## Advertising   1.06090746 -0.002788456 -0.55726637 -0.60921569  0.000000000
## ProdLine      2.74821968  1.460787002 -1.18522655  0.19052316 -1.248460807
## SalesFImage  -3.80921125 10.081854496 -0.87916864 -0.20784517  2.555921881
## ComPricing   -0.13154519  0.139913642 -1.25856891 -0.04209363 -0.594981366
## WartyClaim   -0.40142023 -1.165122048 11.99562484 -1.03519395  0.258924708
## OrdBilling    0.51223505  1.468888754 -1.57829305  2.82489237 -0.306523103
## DelSpeed     -3.33624278 -0.078724768 -0.20562952  5.84663073  1.961341689
## Satisfaction  7.17336519 -3.288799958  0.51778207  1.64199901 -0.422316520
##                ProdLine SalesFImage  ComPricing WartyClaim OrdBilling
## ProdQual      2.7482197 -3.80921125 -0.13154519 -0.4014202  0.5122350
## Ecom          1.4607870 10.08185450  0.13991364 -1.1651220  1.4688888
## TechSup      -1.1852266 -0.87916864 -1.25856891 11.9956248 -1.5782930
## CompRes       0.1905232 -0.20784517 -0.04209363 -1.0351939  2.8248924
## Advertising  -1.2484608  2.55592188 -0.59498137  0.2589247 -0.3065231
## ProdLine      0.0000000 -2.21876450 -3.64012829  2.4965744 -2.7464786
## SalesFImage  -2.2187645  0.00000000  1.19736653  1.8084929 -1.7398559
## ComPricing   -3.6401283  1.19736653  0.00000000  0.1909540 -0.8145030
## WartyClaim    2.4965744  1.80849286  0.19095404  0.0000000  2.5242649
## OrdBilling   -2.7464786 -1.73985585 -0.81450302  2.5242649  0.0000000
## DelSpeed      5.4434474  0.05525335  1.81976992 -0.8633775  3.5110350
## Satisfaction  1.7486638  8.24679932 -0.82367672 -0.8351894  1.4115877
##                 DelSpeed Satisfaction
## ProdQual     -3.33624278    7.1733652
## Ecom         -0.07872477   -3.2888000
## TechSup      -0.20562952    0.5177821
## CompRes       5.84663073    1.6419990
## Advertising   1.96134169   -0.4223165
## ProdLine      5.44344736    1.7486638
## SalesFImage   0.05525335    8.2467993
## ComPricing    1.81976992   -0.8236767
## WartyClaim   -0.86337748   -0.8351894
## OrdBilling    3.51103503    1.4115877
## DelSpeed      0.00000000    0.8435005
## Satisfaction  0.84350046    0.0000000
## 
## $n
## [1] 100
## 
## $gp
## [1] 10
## 
## $method
## [1] "pearson"

We can see that there is a high degree of collinearity between the independent variables.

Multiple Linear Regression

model1 <- lm(Satisfaction ~ ., prodsurvey[,-1])
summary(model1)

## 
## Call:
## lm(formula = Satisfaction ~ ., data = prodsurvey[, -1])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.43005 -0.31165  0.07621  0.37190  0.90120 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.66961    0.81233  -0.824  0.41199    
## ProdQual     0.37137    0.05177   7.173 2.18e-10 ***
## Ecom        -0.44056    0.13396  -3.289  0.00145 ** 
## TechSup      0.03299    0.06372   0.518  0.60591    
## CompRes      0.16703    0.10173   1.642  0.10416    
## Advertising -0.02602    0.06161  -0.422  0.67382    
## ProdLine     0.14034    0.08025   1.749  0.08384 .  
## SalesFImage  0.80611    0.09775   8.247 1.45e-12 ***
## ComPricing  -0.03853    0.04677  -0.824  0.41235    
## WartyClaim  -0.10298    0.12330  -0.835  0.40587    
## OrdBilling   0.14635    0.10367   1.412  0.16160    
## DelSpeed     0.16570    0.19644   0.844  0.40124    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5623 on 88 degrees of freedom
## Multiple R-squared:  0.8021, Adjusted R-squared:  0.7774 
## F-statistic: 32.43 on 11 and 88 DF,  p-value: < 2.2e-16

As in our model the adjusted R-squared: 0.7774, meaning that independent variables explain 78% of the variance of the dependent variable, only 3 variables are significant out of 11 independent variables.
The p-value of the F-statistic is less than 0.05 (level of Significance), which means our model is significant. This means that, at least, one of the predictor variables is significantly related to the outcome variable.
Our model equation can be written as:
Satisfaction = -0.66 + 0.37 x ProdQual -0.44 x Ecom + 0.034 x TechSup + 0.16 x CompRes -0.02 x Advertising + 0.14 x ProdLine + 0.80 x SalesFImage - 0.038 x CompPricing - 0.10 x WartyClaim + 0.14 x OrdBilling + 0.16 x DelSpeed

Variable Inflation Factor (VIF)

vif(model1)

##    ProdQual        Ecom     TechSup     CompRes Advertising    ProdLine 
##    1.635797    2.756694    2.976796    4.730448    1.508933    3.488185 
## SalesFImage  ComPricing  WartyClaim  OrdBilling    DelSpeed 
##    3.439420    1.635000    3.198337    2.902999    6.516014

High Variable Inflation Factor (VIF) is a sign of multicollinearity. There is no formal VIF value for determining the presence of multicollinearity; however, in weaker models, VIF value greater than 2.5 may be a cause of concern.
From the VIF values, we can infer that variables CompRes and DelSpeed are a cause of concern.
Remedial Measures: Two of the most commonly used methods to deal with multicollinearity in the model is the following.
- Remove some of the highly correlated variables using VIF or stepwise algorithms.
- Perform an analysis design like principal component analysis (PCA)/ Factor Analysis on the correlated variables.

Factor Analysis

X <- prodsurvey[,-c(1,13)]
y <- prodsurvey[,13]

Let’s check the factorability of the variables in the dataset: perform the Kaiser-Meyer-Olkin (KMO) Test. > Kaiser-Meyer-Olkin (KMO) Test and/or Bartlett’s Test is a measure of how suited your data is for Factor Analysis. The test measures sampling adequacy for each variable in the model and for the complete model. The statistic is a measure of the proportion of variance among variables that might be common variance. The lower the proportion, the more suited your data is to Factor Analysis.

datamatrix2 <- cor(X)
KMO(r=datamatrix2)

## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = datamatrix2)
## Overall MSA =  0.65
## MSA for each item = 
##    ProdQual        Ecom     TechSup     CompRes Advertising    ProdLine 
##        0.51        0.63        0.52        0.79        0.78        0.62 
## SalesFImage  ComPricing  WartyClaim  OrdBilling    DelSpeed 
##        0.62        0.75        0.51        0.76        0.67

MSA (measure of sampling adequacy) > 0.5, we can run Factor Analysis on this data.

print(cortest.bartlett(datamatrix2, nrow(prodsurvey[,-1])))

## $chisq
## [1] 619.2726
## 
## $p.value
## [1] 1.79337e-96
## 
## $df
## [1] 55

The approximate of Chi-square is 619.27 with 55 degrees of freedom, which is significant at 0.05 Level of significance.

Hence Factor Analysis is considered as an appropriate technique for further analysis of the data.

Find the optimal clusters

Plotting the eigenvalues from our factor analysis (whether it’s based on principal axis or principal components extraction), a parallel analysis involves generating random correlation matrices and after factor analyzing them, comparing the resulting eigenvalues to the eigenvalues of the observed data. The idea behind this method is thatobserved eigenvalues that are higher than their corresponding random eigenvalues are more likely to be from “meaningful factors” than observed eigenvalues that are below their corresponding random eigenvalue.

parallel <- fa.parallel(X)

## Parallel analysis suggests that the number of factors =  3  and the number of components =  3

# other option of scree plot
#scree(X)

Selection of factors from the scree plot can be based on: 1. Kaiser-Guttman normalization rule says that we should choose all factors with an eigenvalue greater than 1. 2. Bend elbow rule

When looking at the parallel analysis scree plots, there are two places to look depending on which type of factor analysis you’re looking to run. The two blue lines show you the observed eigenvalues - they should look identical to the scree plots drawn by the scree function. The red dotted lines show the random eigenvalues or the simulated data line. Each point on the blue line that lies above the corresponding simulated data line is a factor or component to extract. In this analysis, you can see that 4 factors in the “Factor Analysis” parallel analysis lie above the corresponding simulated data line and 3 components in the “Principal Components” parallel analysis lie above the corresponding simulated data line. In our case, however, the last factor/component lies very close to the line - for both principal components extraction and principal axis extraction. Thus, we should probably compare the 3 factor and 4 factor solutions, to see which one is most interpretable.

The scree plot graphs the Eigenvalue against each factor. We can see from the graph that after factor 4 there is a sharp change in the curvature of the scree plot. This shows that after factor 4 the total variance accounts for smaller amounts.

ev <- eigen(cor(X))
ev$values

##  [1] 3.42697133 2.55089671 1.69097648 1.08655606 0.60942409 0.55188378
##  [7] 0.40151815 0.24695154 0.20355327 0.13284158 0.09842702

Factor <- rep(1:11)
Eigen_Values <- ev$values
Scree <- data.frame(Factor, Eigen_Values)

ggplot(data = Scree, mapping = aes(x = Factor, y = Eigen_Values)) + 
  geom_point() +
  geom_line() +
  scale_y_continuous(name = "Eigen Values", limits = c(0,4)) +
  theme_bw() + 
  theme(panel.grid.major.y = element_line(color = "skyblue")) +
  ggtitle("Scree Plot")

So as per the elbow or Kaiser-Guttman normalization rule, we are good to go ahead with 4 factors.

Conducting the Factor Analysis

Factor analysis using fa method:

fa.none <-  fa(r=X, 
              nfactors = 4, 
#              covar = FALSE, SMC = TRUE,
              fm="pa", # type of factor analysis we want to use (“pa” is principal axis factoring)
              max.iter=100, # (50 is the default, but we have changed it to 100
              rotate="none") # none rotation
print(fa.none)

## Factor Analysis using method =  pa
## Call: fa(r = X, nfactors = 4, rotate = "none", max.iter = 100, fm = "pa")
## Standardized loadings (pattern matrix) based upon correlation matrix
##               PA1   PA2   PA3   PA4   h2    u2 com
## ProdQual     0.20 -0.41 -0.06  0.46 0.42 0.576 2.4
## Ecom         0.29  0.66  0.27  0.22 0.64 0.362 2.0
## TechSup      0.28 -0.38  0.74 -0.17 0.79 0.205 1.9
## CompRes      0.86  0.01 -0.26 -0.18 0.84 0.157 1.3
## Advertising  0.29  0.46  0.08  0.13 0.31 0.686 1.9
## ProdLine     0.69 -0.45 -0.14  0.31 0.80 0.200 2.3
## SalesFImage  0.39  0.80  0.35  0.25 0.98 0.021 2.1
## ComPricing  -0.23  0.55 -0.04 -0.29 0.44 0.557 1.9
## WartyClaim   0.38 -0.32  0.74 -0.15 0.81 0.186 2.0
## OrdBilling   0.75  0.02 -0.18 -0.18 0.62 0.378 1.2
## DelSpeed     0.90  0.10 -0.30 -0.20 0.94 0.058 1.4
## 
##                        PA1  PA2  PA3  PA4
## SS loadings           3.21 2.22 1.50 0.68
## Proportion Var        0.29 0.20 0.14 0.06
## Cumulative Var        0.29 0.49 0.63 0.69
## Proportion Explained  0.42 0.29 0.20 0.09
## Cumulative Proportion 0.42 0.71 0.91 1.00
## 
## Mean item complexity =  1.9
## Test of the hypothesis that 4 factors are sufficient.
## 
## The degrees of freedom for the null model are  55  and the objective function was  6.55 with Chi Square of  619.27
## The degrees of freedom for the model are 17  and the objective function was  0.33 
## 
## The root mean square of the residuals (RMSR) is  0.02 
## The df corrected root mean square of the residuals is  0.03 
## 
## The harmonic number of observations is  100 with the empirical chi square  3.19  with prob <  1 
## The total number of observations was  100  with Likelihood Chi Square =  30.27  with prob <  0.024 
## 
## Tucker Lewis Index of factoring reliability =  0.921
## RMSEA index =  0.096  and the 90 % confidence intervals are  0.032 0.139
## BIC =  -48.01
## Fit based upon off diagonal values = 1
## Measures of factor score adequacy             
##                                                    PA1  PA2  PA3  PA4
## Correlation of (regression) scores with factors   0.98 0.97 0.95 0.88
## Multiple R square of scores with factors          0.96 0.95 0.91 0.78
## Minimum correlation of possible factor scores     0.92 0.90 0.82 0.56

Factor analysis using the factanal method:

factanal.none <- factanal(X, factors=4, scores = c("regression"), rotation = "none")
print(factanal.none)

## 
## Call:
## factanal(x = X, factors = 4, scores = c("regression"), rotation = "none")
## 
## Uniquenesses:
##    ProdQual        Ecom     TechSup     CompRes Advertising    ProdLine 
##       0.682       0.360       0.228       0.178       0.679       0.005 
## SalesFImage  ComPricing  WartyClaim  OrdBilling    DelSpeed 
##       0.017       0.636       0.163       0.347       0.076 
## 
## Loadings:
##             Factor1 Factor2 Factor3 Factor4
## ProdQual     0.467  -0.148  -0.229  -0.159 
## Ecom                 0.791                 
## TechSup      0.198          -0.482   0.707 
## CompRes      0.589   0.316   0.535   0.300 
## Advertising          0.556   0.110         
## ProdLine     0.997                         
## SalesFImage          0.987                 
## ComPricing  -0.491   0.252   0.238         
## WartyClaim   0.279   0.124  -0.487   0.712 
## OrdBilling   0.452   0.275   0.493   0.360 
## DelSpeed     0.629   0.364   0.582   0.241 
## 
##                Factor1 Factor2 Factor3 Factor4
## SS loadings      2.522   2.316   1.469   1.322
## Proportion Var   0.229   0.211   0.134   0.120
## Cumulative Var   0.229   0.440   0.573   0.694
## 
## Test of the hypothesis that 4 factors are sufficient.
## The chi square statistic is 24.26 on 17 degrees of freedom.
## The p-value is 0.113

factanal.var <- factanal(X, factors=4, scores = c("regression"), rotation = "varimax")
print(factanal.var)

## 
## Call:
## factanal(x = X, factors = 4, scores = c("regression"), rotation = "varimax")
## 
## Uniquenesses:
##    ProdQual        Ecom     TechSup     CompRes Advertising    ProdLine 
##       0.682       0.360       0.228       0.178       0.679       0.005 
## SalesFImage  ComPricing  WartyClaim  OrdBilling    DelSpeed 
##       0.017       0.636       0.163       0.347       0.076 
## 
## Loadings:
##             Factor1 Factor2 Factor3 Factor4
## ProdQual                             0.557 
## Ecom                 0.793                 
## TechSup                      0.872   0.102 
## CompRes      0.884   0.142           0.135 
## Advertising  0.190   0.521          -0.110 
## ProdLine     0.502           0.104   0.856 
## SalesFImage  0.119   0.974          -0.130 
## ComPricing           0.225  -0.216  -0.514 
## WartyClaim                   0.894   0.158 
## OrdBilling   0.794   0.101   0.105         
## DelSpeed     0.928   0.189           0.164 
## 
##                Factor1 Factor2 Factor3 Factor4
## SS loadings      2.592   1.977   1.638   1.423
## Proportion Var   0.236   0.180   0.149   0.129
## Cumulative Var   0.236   0.415   0.564   0.694
## 
## Test of the hypothesis that 4 factors are sufficient.
## The chi square statistic is 24.26 on 17 degrees of freedom.
## The p-value is 0.113

Graph Factor Loading Matrices

Factor analysis results are typically interpreted in terms of the major loadings on each factor. These structures may be represented an “interpretable” solution as a table of loadings or graphically, where all loadings with an absolute value \(\rightarrow\) some cut point are represented as an edge (path).

fa.diagram(fa.none)

The first 4 factors have an Eigenvalue >1 and which explains almost 69% of the variance. We can effectively reduce dimensionality from 11 to 4 while only losing about 31% of the variance.
Factor 1 accounts for 29.20% of the variance; Factor 2 accounts for 20.20% of the variance; Factor 3 accounts for 13.60% of the variance; Factor 4 accounts for 6% of the variance. All the 4 factors together explain for 69% of the variance in performance.

fa.var <-  fa(r=X, 
              nfactors = 4, 
#              covar = FALSE, SMC = TRUE,
              fm="pa", # type of factor analysis we want to use (“pa” is principal axis factoring)
              max.iter=100, # (50 is the default, but we have changed it to 100
              rotate="varimax") # none rotation
print(fa.var)

## Factor Analysis using method =  pa
## Call: fa(r = X, nfactors = 4, rotate = "varimax", max.iter = 100, fm = "pa")
## Standardized loadings (pattern matrix) based upon correlation matrix
##               PA1   PA2   PA3   PA4   h2    u2 com
## ProdQual     0.02 -0.07  0.02  0.65 0.42 0.576 1.0
## Ecom         0.07  0.79  0.03 -0.11 0.64 0.362 1.1
## TechSup      0.02 -0.03  0.88  0.12 0.79 0.205 1.0
## CompRes      0.90  0.13  0.05  0.13 0.84 0.157 1.1
## Advertising  0.17  0.53 -0.04 -0.06 0.31 0.686 1.2
## ProdLine     0.53 -0.04  0.13  0.71 0.80 0.200 1.9
## SalesFImage  0.12  0.97  0.06 -0.13 0.98 0.021 1.1
## ComPricing  -0.08  0.21 -0.21 -0.59 0.44 0.557 1.6
## WartyClaim   0.10  0.06  0.89  0.13 0.81 0.186 1.1
## OrdBilling   0.77  0.13  0.09  0.09 0.62 0.378 1.1
## DelSpeed     0.95  0.19  0.00  0.09 0.94 0.058 1.1
## 
##                        PA1  PA2  PA3  PA4
## SS loadings           2.63 1.97 1.64 1.37
## Proportion Var        0.24 0.18 0.15 0.12
## Cumulative Var        0.24 0.42 0.57 0.69
## Proportion Explained  0.35 0.26 0.22 0.18
## Cumulative Proportion 0.35 0.60 0.82 1.00
## 
## Mean item complexity =  1.2
## Test of the hypothesis that 4 factors are sufficient.
## 
## The degrees of freedom for the null model are  55  and the objective function was  6.55 with Chi Square of  619.27
## The degrees of freedom for the model are 17  and the objective function was  0.33 
## 
## The root mean square of the residuals (RMSR) is  0.02 
## The df corrected root mean square of the residuals is  0.03 
## 
## The harmonic number of observations is  100 with the empirical chi square  3.19  with prob <  1 
## The total number of observations was  100  with Likelihood Chi Square =  30.27  with prob <  0.024 
## 
## Tucker Lewis Index of factoring reliability =  0.921
## RMSEA index =  0.096  and the 90 % confidence intervals are  0.032 0.139
## BIC =  -48.01
## Fit based upon off diagonal values = 1
## Measures of factor score adequacy             
##                                                    PA1  PA2  PA3  PA4
## Correlation of (regression) scores with factors   0.98 0.99 0.94 0.88
## Multiple R square of scores with factors          0.96 0.97 0.88 0.78
## Minimum correlation of possible factor scores     0.93 0.94 0.77 0.55

fa.var$loadings

## 
## Loadings:
##             PA1    PA2    PA3    PA4   
## ProdQual                          0.647
## Ecom                0.787        -0.113
## TechSup                    0.883  0.116
## CompRes      0.898  0.130         0.132
## Advertising  0.166  0.530              
## ProdLine     0.525         0.127  0.712
## SalesFImage  0.115  0.971        -0.135
## ComPricing          0.213 -0.209 -0.590
## WartyClaim   0.103         0.885  0.128
## OrdBilling   0.768  0.127              
## DelSpeed     0.949  0.185              
## 
##                  PA1   PA2   PA3   PA4
## SS loadings    2.635 1.967 1.641 1.371
## Proportion Var 0.240 0.179 0.149 0.125
## Cumulative Var 0.240 0.418 0.568 0.692

fa.diagram(fa.var)

The red dotted line means that Competitive Pricing marginally falls under the PA4 bucket and the loading are negative.

Labeling and interpretation of the factors

Factors	Variables	Label	Short Interpretaation
PA1	DelSpeed, CompRes, OrdBilling	Purchase	All the items are related to purchasing the product; from placing order to billing and getting it delivered.
PA2	SalesFImage, Ecom, Advertising	Marketing	In this factors the item are related to marketing processes like the image of sales force and spending on adverstising
PA3	WartyClaim, TechSup	Post Purchase	Post purchase activities are included in this factor; like warranty claims and technical support.
PA4	ProdLine, ProdQual, CompPricing	Product Position	Product positioning related items are grouped in this factor.

Scores for all the rows

head(fa.var$scores)

##             PA1        PA2          PA3         PA4
## [1,] -0.1338871  0.9175166 -1.719604873  0.09135411
## [2,]  1.6297604 -2.0090053 -0.596361722  0.65808192
## [3,]  0.3637658  0.8361736  0.002979966  1.37548765
## [4,] -1.2225230 -0.5491336  1.245473305 -0.64421384
## [5,] -0.4854209 -0.4276223 -0.026980304  0.47360747
## [6,] -0.5950924 -1.3035333 -1.183019401 -0.95913571

#fa.var$scores # for 100 observations

Regression analysis using the factors scores as the independent variable

regdata <- cbind(prodsurvey["Satisfaction"], fa.var$scores)

#Labeling the data
names(regdata) <- c("Satisfaction", "Purchase", "Marketing",
                    "Post_purchase", "Prod_positioning")
head(regdata)

##   Satisfaction   Purchase  Marketing Post_purchase Prod_positioning
## 1          8.2 -0.1338871  0.9175166  -1.719604873       0.09135411
## 2          5.7  1.6297604 -2.0090053  -0.596361722       0.65808192
## 3          8.9  0.3637658  0.8361736   0.002979966       1.37548765
## 4          4.8 -1.2225230 -0.5491336   1.245473305      -0.64421384
## 5          7.1 -0.4854209 -0.4276223  -0.026980304       0.47360747
## 6          4.7 -0.5950924 -1.3035333  -1.183019401      -0.95913571

Splitting the data to train and test set

#Splitting the data 70:30
#Random number generator, set seed.
set.seed(100)
indices= sample(1:nrow(regdata), 0.7*nrow(regdata))
train=regdata[indices,]
test = regdata[-indices,]

Regression Model using train data

model.fa.score = lm(Satisfaction~., train)
summary(model.fa.score)

## 
## Call:
## lm(formula = Satisfaction ~ ., data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.76280 -0.48717  0.06799  0.46459  1.24022 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       6.91852    0.08068  85.757  < 2e-16 ***
## Purchase          0.50230    0.07641   6.574 9.74e-09 ***
## Marketing         0.75488    0.08390   8.998 5.00e-13 ***
## Post_purchase     0.08755    0.08216   1.066    0.291    
## Prod_positioning  0.58074    0.08781   6.614 8.30e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6629 on 65 degrees of freedom
## Multiple R-squared:  0.7261, Adjusted R-squared:  0.7093 
## F-statistic: 43.08 on 4 and 65 DF,  p-value: < 2.2e-16

Check vif

vif(model.fa.score)

##         Purchase        Marketing    Post_purchase Prod_positioning 
##         1.009648         1.008235         1.015126         1.024050

As per the VIF values, we don’t have multicollinearity in the model

Check prediction of the model in the test dataset

#Model Performance metrics:
pred_test <- predict(model.fa.score, newdata = test, type = "response")
pred_test

##        6        8       11       13       17       19       26       33 
## 4.975008 5.908267 6.951629 8.677431 6.613838 6.963113 6.313513 6.141338 
##       34       35       37       40       42       44       49       50 
## 6.158993 7.415742 6.589746 6.858206 7.133989 8.533080 8.765145 8.078744 
##       53       56       57       60       65       67       71       73 
## 7.395438 7.468360 8.744402 6.276660 5.936570 6.650322 8.299545 7.685564 
##       75       80       96       97       99      100 
## 7.330191 6.719528 7.540233 6.143172 8.084583 5.799897

#Find MSE and MAPE scores:
#MSE/ MAPE of Model1
test$Satisfaction_Predicted <- pred_test
head(test[c("Satisfaction","Satisfaction_Predicted")], 10)

##    Satisfaction Satisfaction_Predicted
## 6           4.7               4.975008
## 8           6.3               5.908267
## 11          7.4               6.951629
## 13          8.4               8.677431
## 17          6.4               6.613838
## 19          6.8               6.963113
## 26          6.6               6.313513
## 33          5.4               6.141338
## 34          7.3               6.158993
## 35          6.3               7.415742

test_r2 <- cor(test$Satisfaction, test$Satisfaction_Predicted) ^2

mse_test <- mse(test$Satisfaction, pred_test)
rmse_test <- sqrt(mse(test$Satisfaction, pred_test))
mape_test <- mape(test$Satisfaction, pred_test)
model_metrics <- cbind(mse_test,rmse_test,mape_test,test_r2)
print(model_metrics, 3)

##      mse_test rmse_test mape_test test_r2
## [1,]    0.551     0.742    0.0875   0.554

As the feature Post_purchase is not significant so we will drop this feature and then let’s run the regression model again.

##Regression model without post_purchase:
model2 <- lm(Satisfaction ~ Purchase+ Marketing+ 
                Prod_positioning, data= train)
summary(model2)

## 
## Call:
## lm(formula = Satisfaction ~ Purchase + Marketing + Prod_positioning, 
##     data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7268 -0.5036  0.1117  0.4408  1.2962 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       6.92158    0.08071  85.761  < 2e-16 ***
## Purchase          0.49779    0.07637   6.518 1.15e-08 ***
## Marketing         0.75870    0.08391   9.042 3.66e-13 ***
## Prod_positioning  0.59084    0.08739   6.761 4.30e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6636 on 66 degrees of freedom
## Multiple R-squared:  0.7213, Adjusted R-squared:  0.7087 
## F-statistic: 56.95 on 3 and 66 DF,  p-value: < 2.2e-16

pred_test2 <- predict(model2, newdata = test, type = "response")
pred_test2

##        6        8       11       13       17       19       26       33 
## 5.069667 5.958765 7.001694 8.695630 6.521440 6.994018 6.249667 6.123357 
##       34       35       37       40       42       44       49       50 
## 6.132165 7.415882 6.566556 6.964570 7.233376 8.424720 8.853060 8.177009 
##       53       56       57       60       65       67       71       73 
## 7.413693 7.404078 8.746511 6.215816 5.893515 6.476465 8.227837 7.839941 
##       75       80       96       97       99      100 
## 7.354907 6.734279 7.559995 6.293774 8.083909 5.832094

test$Satisfaction_Predicted2 <- pred_test2
head(test[c(1,7)], 10)

##    Satisfaction Satisfaction_Predicted2
## 6           4.7                5.069667
## 8           6.3                5.958765
## 11          7.4                7.001694
## 13          8.4                8.695630
## 17          6.4                6.521440
## 19          6.8                6.994018
## 26          6.6                6.249667
## 33          5.4                6.123357
## 34          7.3                6.132165
## 35          6.3                7.415882

test_r22 <- cor(test$Satisfaction, test$Satisfaction_Predicted2) ^2
mse_test2 <- mse(test$Satisfaction, pred_test2)
rmse_test2 <- sqrt(mse(test$Satisfaction, pred_test2))
mape_test2 <- mape(test$Satisfaction, pred_test2)

model2_metrics <- cbind(mse_test2,rmse_test2,mape_test2,test_r22)
model2_metrics

##      mse_test2 rmse_test2 mape_test2  test_r22
## [1,] 0.5446039  0.7379728 0.08490197 0.5595099

Overall <- rbind(model_metrics,model2_metrics)
row.names(Overall) <- c("Test1", "Test2")
colnames(Overall) <- c("MSE", "RMSE", "MAPE", "R-squared")
print(Overall,3)

##         MSE  RMSE   MAPE R-squared
## Test1 0.551 0.742 0.0875     0.554
## Test2 0.545 0.738 0.0849     0.560

Try the interaction

Including Interaction model, we are able to make a better prediction!??

Conclusion

We saw how Factor Analysis can be used to reduce the dimensionality of a dataset and then we used multiple linear regression on the dimensionally reduced columns/Features for further analysis/predictions.

Reference (adapted from)

Jay Narayan, Multiple Linear Regression & Factor Analysis in R, May 28, 2019
Rachael Smyth and Andrew Johnson, Factor Analysis
Sreenu Konda, Lecture on Factor Analysis, Bistatistics, SPH, UIC, Spring 2020