1- Is it possible that we can run the regression analysis on factors generating from Factor Analysis?
2- If so, how can we run it?
3- What is difference between FA and PCA?
The demonstration is to use the dataset Factor-Hair-Revised.csv to build a regression model to predict satisfaction.
prodsurvey <- read_csv("Factor-Hair-Revised.csv")
## Parsed with column specification:
## cols(
## ID = col_double(),
## ProdQual = col_double(),
## Ecom = col_double(),
## TechSup = col_double(),
## CompRes = col_double(),
## Advertising = col_double(),
## ProdLine = col_double(),
## SalesFImage = col_double(),
## ComPricing = col_double(),
## WartyClaim = col_double(),
## OrdBilling = col_double(),
## DelSpeed = col_double(),
## Satisfaction = col_double()
## )
head(prodsurvey)
## # A tibble: 6 x 13
## ID ProdQual Ecom TechSup CompRes Advertising ProdLine SalesFImage
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 8.5 3.9 2.5 5.9 4.8 4.9 6
## 2 2 8.2 2.7 5.1 7.2 3.4 7.9 3.1
## 3 3 9.2 3.4 5.6 5.6 5.4 7.4 5.8
## 4 4 6.4 3.3 7 3.7 4.7 4.7 4.5
## 5 5 9 3.4 5.2 4.6 2.2 6 4.5
## 6 6 6.5 2.8 3.1 4.1 4 4.3 3.7
## # ... with 5 more variables: ComPricing <dbl>, WartyClaim <dbl>,
## # OrdBilling <dbl>, DelSpeed <dbl>, Satisfaction <dbl>
dim(prodsurvey)
## [1] 100 13
str(prodsurvey)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 100 obs. of 13 variables:
## $ ID : num 1 2 3 4 5 6 7 8 9 10 ...
## $ ProdQual : num 8.5 8.2 9.2 6.4 9 6.5 6.9 6.2 5.8 6.4 ...
## $ Ecom : num 3.9 2.7 3.4 3.3 3.4 2.8 3.7 3.3 3.6 4.5 ...
## $ TechSup : num 2.5 5.1 5.6 7 5.2 3.1 5 3.9 5.1 5.1 ...
## $ CompRes : num 5.9 7.2 5.6 3.7 4.6 4.1 2.6 4.8 6.7 6.1 ...
## $ Advertising : num 4.8 3.4 5.4 4.7 2.2 4 2.1 4.6 3.7 4.7 ...
## $ ProdLine : num 4.9 7.9 7.4 4.7 6 4.3 2.3 3.6 5.9 5.7 ...
## $ SalesFImage : num 6 3.1 5.8 4.5 4.5 3.7 5.4 5.1 5.8 5.7 ...
## $ ComPricing : num 6.8 5.3 4.5 8.8 6.8 8.5 8.9 6.9 9.3 8.4 ...
## $ WartyClaim : num 4.7 5.5 6.2 7 6.1 5.1 4.8 5.4 5.9 5.4 ...
## $ OrdBilling : num 5 3.9 5.4 4.3 4.5 3.6 2.1 4.3 4.4 4.1 ...
## $ DelSpeed : num 3.7 4.9 4.5 3 3.5 3.3 2 3.7 4.6 4.4 ...
## $ Satisfaction: num 8.2 5.7 8.9 4.8 7.1 4.7 5.7 6.3 7 5.5 ...
## - attr(*, "spec")=
## .. cols(
## .. ID = col_double(),
## .. ProdQual = col_double(),
## .. Ecom = col_double(),
## .. TechSup = col_double(),
## .. CompRes = col_double(),
## .. Advertising = col_double(),
## .. ProdLine = col_double(),
## .. SalesFImage = col_double(),
## .. ComPricing = col_double(),
## .. WartyClaim = col_double(),
## .. OrdBilling = col_double(),
## .. DelSpeed = col_double(),
## .. Satisfaction = col_double()
## .. )
names(prodsurvey)
## [1] "ID" "ProdQual" "Ecom" "TechSup" "CompRes"
## [6] "Advertising" "ProdLine" "SalesFImage" "ComPricing" "WartyClaim"
## [11] "OrdBilling" "DelSpeed" "Satisfaction"
describe(prodsurvey)
## vars n mean sd median trimmed mad min max range skew
## ID 1 100 50.50 29.01 50.50 50.50 37.06 1.0 100.0 99.0 0.00
## ProdQual 2 100 7.81 1.40 8.00 7.85 1.78 5.0 10.0 5.0 -0.24
## Ecom 3 100 3.67 0.70 3.60 3.63 0.52 2.2 5.7 3.5 0.64
## TechSup 4 100 5.37 1.53 5.40 5.40 1.85 1.3 8.5 7.2 -0.20
## CompRes 5 100 5.44 1.21 5.45 5.46 1.26 2.6 7.8 5.2 -0.13
## Advertising 6 100 4.01 1.13 4.00 4.00 1.19 1.9 6.5 4.6 0.04
## ProdLine 7 100 5.80 1.32 5.75 5.81 1.56 2.3 8.4 6.1 -0.09
## SalesFImage 8 100 5.12 1.07 4.90 5.09 0.89 2.9 8.2 5.3 0.37
## ComPricing 9 100 6.97 1.55 7.10 7.01 1.93 3.7 9.9 6.2 -0.23
## WartyClaim 10 100 6.04 0.82 6.10 6.04 0.89 4.1 8.1 4.0 0.01
## OrdBilling 11 100 4.28 0.93 4.40 4.31 0.74 2.0 6.7 4.7 -0.32
## DelSpeed 12 100 3.89 0.73 3.90 3.92 0.74 1.6 5.5 3.9 -0.45
## Satisfaction 13 100 6.92 1.19 7.05 6.90 1.33 4.7 9.9 5.2 0.08
## kurtosis se
## ID -1.24 2.90
## ProdQual -1.17 0.14
## Ecom 0.57 0.07
## TechSup -0.63 0.15
## CompRes -0.66 0.12
## Advertising -0.94 0.11
## ProdLine -0.60 0.13
## SalesFImage 0.26 0.11
## ComPricing -0.96 0.15
## WartyClaim -0.53 0.08
## OrdBilling 0.11 0.09
## DelSpeed 0.09 0.07
## Satisfaction -0.86 0.12
datamatrix1 <- cor(prodsurvey[,-1])
corrplot(datamatrix1, method="number")
As we can see from the above correlation matrix: 1. Complaint resolution (CompRes
) and delivery speed (DelSpeed
) are highly correlated 2. Order billing (OrdBilling
) and CompRes
are highly correlated 3. WartyClaim
and TechSupport
are highly correlated 4. OrdBilling
and DelSpeed
are highly correlated 5. e-commerce (Ecom
) and sales force image (SalesFImage
) are highly correlated
pcor(prodsurvey[,-1], method="pearson")
## $estimate
## ProdQual Ecom TechSup CompRes Advertising
## ProdQual 1.000000000 0.1549597037 0.002424222 -0.056354142 0.1123767461
## Ecom 0.154959704 1.0000000000 0.082359000 -0.033513777 -0.0002972504
## TechSup 0.002424222 0.0823590001 1.000000000 0.143603415 -0.0593002540
## CompRes -0.056354142 -0.0335137768 0.143603415 1.000000000 -0.0648060931
## Advertising 0.112376746 -0.0002972504 -0.059300254 -0.064806093 1.0000000000
## ProdLine 0.281144724 0.1538660545 -0.125349050 0.020305650 -0.1319231866
## SalesFImage -0.376228551 0.7321011890 -0.093310796 -0.022150933 0.2628793429
## ComPricing -0.014021386 0.0149131857 -0.132972485 -0.004487151 -0.0632980381
## WartyClaim -0.042752416 -0.1232553822 0.787729606 -0.109686211 0.0275909587
## OrdBilling 0.054523215 0.1546990521 -0.165914724 0.288344382 -0.0326580430
## DelSpeed -0.335084210 -0.0083917930 -0.021914916 0.528932328 0.2046544064
## Satisfaction 0.607438787 -0.3308440834 0.055111867 0.172416347 -0.0449735411
## ProdLine SalesFImage ComPricing WartyClaim OrdBilling
## ProdQual 0.28114472 -0.376228551 -0.014021386 -0.04275242 0.05452322
## Ecom 0.15386605 0.732101189 0.014913186 -0.12325538 0.15469905
## TechSup -0.12534905 -0.093310796 -0.132972485 0.78772961 -0.16591472
## CompRes 0.02030565 -0.022150933 -0.004487151 -0.10968621 0.28834438
## Advertising -0.13192319 0.262879343 -0.063298038 0.02759096 -0.03265804
## ProdLine 1.00000000 -0.230170570 -0.361757904 0.25718360 -0.28098068
## SalesFImage -0.23017057 1.000000000 0.126612489 0.18930027 -0.18235930
## ComPricing -0.36175790 0.126612489 1.000000000 0.02035155 -0.08650087
## WartyClaim 0.25718360 0.189300271 0.020351554 1.00000000 0.25984451
## OrdBilling -0.28098068 -0.182359301 -0.086500869 0.25984451 1.00000000
## DelSpeed 0.50189505 0.005889925 0.190437993 -0.09164900 0.35053021
## Satisfaction 0.18325156 0.660251850 -0.087467711 -0.08868071 0.14880055
## DelSpeed Satisfaction
## ProdQual -0.335084210 0.60743879
## Ecom -0.008391793 -0.33084408
## TechSup -0.021914916 0.05511187
## CompRes 0.528932328 0.17241635
## Advertising 0.204654406 -0.04497354
## ProdLine 0.501895051 0.18325156
## SalesFImage 0.005889925 0.66025185
## ComPricing 0.190437993 -0.08746771
## WartyClaim -0.091649002 -0.08868071
## OrdBilling 0.350530212 0.14880055
## DelSpeed 1.000000000 0.08955614
## Satisfaction 0.089556143 1.00000000
##
## $p.value
## ProdQual Ecom TechSup CompRes Advertising
## ProdQual 0.000000e+00 1.447435e-01 9.819081e-01 5.977987e-01 0.29163501
## Ecom 1.447435e-01 0.000000e+00 4.402823e-01 7.538375e-01 0.99778145
## TechSup 9.819081e-01 4.402823e-01 0.000000e+00 1.769171e-01 0.57876015
## CompRes 5.977987e-01 7.538375e-01 1.769171e-01 0.000000e+00 0.54395113
## Advertising 2.916350e-01 9.977814e-01 5.787601e-01 5.439511e-01 0.00000000
## ProdLine 7.269030e-03 1.476353e-01 2.391187e-01 8.493380e-01 0.21517368
## SalesFImage 2.576212e-04 2.442012e-16 3.817042e-01 8.358301e-01 0.01230672
## ComPricing 8.956443e-01 8.890480e-01 2.115149e-01 9.665194e-01 0.55338281
## WartyClaim 6.890839e-01 2.471182e-01 3.262868e-20 3.034147e-01 0.79629803
## OrdBilling 6.097697e-01 1.454288e-01 1.180864e-01 5.850710e-03 0.75993047
## DelSpeed 1.245192e-03 9.374303e-01 8.375553e-01 8.359069e-08 0.05300070
## Satisfaction 2.182238e-10 1.447603e-03 6.059096e-01 1.041593e-01 0.67382405
## ProdLine SalesFImage ComPricing WartyClaim OrdBilling
## ProdQual 7.269030e-03 2.576212e-04 0.8956443272 6.890839e-01 0.6097697424
## Ecom 1.476353e-01 2.442012e-16 0.8890479591 2.471182e-01 0.1454287628
## TechSup 2.391187e-01 3.817042e-01 0.2115148546 3.262868e-20 0.1180864174
## CompRes 8.493380e-01 8.358301e-01 0.9665194173 3.034147e-01 0.0058507099
## Advertising 2.151737e-01 1.230672e-02 0.5533828069 7.962980e-01 0.7599304715
## ProdLine 0.000000e+00 2.907563e-02 0.0004593443 1.440249e-02 0.0073046065
## SalesFImage 2.907563e-02 0.000000e+00 0.2343791374 7.394541e-02 0.0853804743
## ComPricing 4.593443e-04 2.343791e-01 0.0000000000 8.490015e-01 0.4175555780
## WartyClaim 1.440249e-02 7.394541e-02 0.8490014635 0.000000e+00 0.0133877361
## OrdBilling 7.304606e-03 8.538047e-02 0.4175555780 1.338774e-02 0.0000000000
## DelSpeed 4.664278e-07 9.560619e-01 0.0721942454 3.902770e-01 0.0007064114
## Satisfaction 8.383625e-02 1.449719e-12 0.4123500152 4.058729e-01 0.1615974575
## DelSpeed Satisfaction
## ProdQual 1.245192e-03 2.182238e-10
## Ecom 9.374303e-01 1.447603e-03
## TechSup 8.375553e-01 6.059096e-01
## CompRes 8.359069e-08 1.041593e-01
## Advertising 5.300070e-02 6.738241e-01
## ProdLine 4.664278e-07 8.383625e-02
## SalesFImage 9.560619e-01 1.449719e-12
## ComPricing 7.219425e-02 4.123500e-01
## WartyClaim 3.902770e-01 4.058729e-01
## OrdBilling 7.064114e-04 1.615975e-01
## DelSpeed 0.000000e+00 4.012356e-01
## Satisfaction 4.012356e-01 0.000000e+00
##
## $statistic
## ProdQual Ecom TechSup CompRes Advertising
## ProdQual 0.00000000 1.471424516 0.02274128 -0.52949015 1.060907458
## Ecom 1.47142452 0.000000000 0.77522957 -0.31456380 -0.002788456
## TechSup 0.02274128 0.775229571 0.00000000 1.36122814 -0.557266374
## CompRes -0.52949015 -0.314563798 1.36122814 0.00000000 -0.609215688
## Advertising 1.06090746 -0.002788456 -0.55726637 -0.60921569 0.000000000
## ProdLine 2.74821968 1.460787002 -1.18522655 0.19052316 -1.248460807
## SalesFImage -3.80921125 10.081854496 -0.87916864 -0.20784517 2.555921881
## ComPricing -0.13154519 0.139913642 -1.25856891 -0.04209363 -0.594981366
## WartyClaim -0.40142023 -1.165122048 11.99562484 -1.03519395 0.258924708
## OrdBilling 0.51223505 1.468888754 -1.57829305 2.82489237 -0.306523103
## DelSpeed -3.33624278 -0.078724768 -0.20562952 5.84663073 1.961341689
## Satisfaction 7.17336519 -3.288799958 0.51778207 1.64199901 -0.422316520
## ProdLine SalesFImage ComPricing WartyClaim OrdBilling
## ProdQual 2.7482197 -3.80921125 -0.13154519 -0.4014202 0.5122350
## Ecom 1.4607870 10.08185450 0.13991364 -1.1651220 1.4688888
## TechSup -1.1852266 -0.87916864 -1.25856891 11.9956248 -1.5782930
## CompRes 0.1905232 -0.20784517 -0.04209363 -1.0351939 2.8248924
## Advertising -1.2484608 2.55592188 -0.59498137 0.2589247 -0.3065231
## ProdLine 0.0000000 -2.21876450 -3.64012829 2.4965744 -2.7464786
## SalesFImage -2.2187645 0.00000000 1.19736653 1.8084929 -1.7398559
## ComPricing -3.6401283 1.19736653 0.00000000 0.1909540 -0.8145030
## WartyClaim 2.4965744 1.80849286 0.19095404 0.0000000 2.5242649
## OrdBilling -2.7464786 -1.73985585 -0.81450302 2.5242649 0.0000000
## DelSpeed 5.4434474 0.05525335 1.81976992 -0.8633775 3.5110350
## Satisfaction 1.7486638 8.24679932 -0.82367672 -0.8351894 1.4115877
## DelSpeed Satisfaction
## ProdQual -3.33624278 7.1733652
## Ecom -0.07872477 -3.2888000
## TechSup -0.20562952 0.5177821
## CompRes 5.84663073 1.6419990
## Advertising 1.96134169 -0.4223165
## ProdLine 5.44344736 1.7486638
## SalesFImage 0.05525335 8.2467993
## ComPricing 1.81976992 -0.8236767
## WartyClaim -0.86337748 -0.8351894
## OrdBilling 3.51103503 1.4115877
## DelSpeed 0.00000000 0.8435005
## Satisfaction 0.84350046 0.0000000
##
## $n
## [1] 100
##
## $gp
## [1] 10
##
## $method
## [1] "pearson"
model1 <- lm(Satisfaction ~ ., prodsurvey[,-1])
summary(model1)
##
## Call:
## lm(formula = Satisfaction ~ ., data = prodsurvey[, -1])
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.43005 -0.31165 0.07621 0.37190 0.90120
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.66961 0.81233 -0.824 0.41199
## ProdQual 0.37137 0.05177 7.173 2.18e-10 ***
## Ecom -0.44056 0.13396 -3.289 0.00145 **
## TechSup 0.03299 0.06372 0.518 0.60591
## CompRes 0.16703 0.10173 1.642 0.10416
## Advertising -0.02602 0.06161 -0.422 0.67382
## ProdLine 0.14034 0.08025 1.749 0.08384 .
## SalesFImage 0.80611 0.09775 8.247 1.45e-12 ***
## ComPricing -0.03853 0.04677 -0.824 0.41235
## WartyClaim -0.10298 0.12330 -0.835 0.40587
## OrdBilling 0.14635 0.10367 1.412 0.16160
## DelSpeed 0.16570 0.19644 0.844 0.40124
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5623 on 88 degrees of freedom
## Multiple R-squared: 0.8021, Adjusted R-squared: 0.7774
## F-statistic: 32.43 on 11 and 88 DF, p-value: < 2.2e-16
ProdQual
-0.44 x Ecom
+ 0.034 x TechSup
+ 0.16 x CompRes
-0.02 x Advertising
+ 0.14 x ProdLine
+ 0.80 x SalesFImage
- 0.038 x CompPricing
- 0.10 x WartyClaim
+ 0.14 x OrdBilling
+ 0.16 x DelSpeed
vif(model1)
## ProdQual Ecom TechSup CompRes Advertising ProdLine
## 1.635797 2.756694 2.976796 4.730448 1.508933 3.488185
## SalesFImage ComPricing WartyClaim OrdBilling DelSpeed
## 3.439420 1.635000 3.198337 2.902999 6.516014
From the VIF values, we can infer that variables CompRes
and DelSpeed
are a cause of concern.
X <- prodsurvey[,-c(1,13)]
y <- prodsurvey[,13]
Let’s check the factorability of the variables in the dataset: perform the Kaiser-Meyer-Olkin (KMO) Test. > Kaiser-Meyer-Olkin (KMO) Test and/or Bartlett’s Test is a measure of how suited your data is for Factor Analysis. The test measures sampling adequacy for each variable in the model and for the complete model. The statistic is a measure of the proportion of variance among variables that might be common variance. The lower the proportion, the more suited your data is to Factor Analysis.
datamatrix2 <- cor(X)
KMO(r=datamatrix2)
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = datamatrix2)
## Overall MSA = 0.65
## MSA for each item =
## ProdQual Ecom TechSup CompRes Advertising ProdLine
## 0.51 0.63 0.52 0.79 0.78 0.62
## SalesFImage ComPricing WartyClaim OrdBilling DelSpeed
## 0.62 0.75 0.51 0.76 0.67
print(cortest.bartlett(datamatrix2, nrow(prodsurvey[,-1])))
## $chisq
## [1] 619.2726
##
## $p.value
## [1] 1.79337e-96
##
## $df
## [1] 55
Hence Factor Analysis is considered as an appropriate technique for further analysis of the data.
Plotting the eigenvalues from our factor analysis (whether it’s based on principal axis or principal components extraction), a parallel analysis involves generating random correlation matrices and after factor analyzing them, comparing the resulting eigenvalues to the eigenvalues of the observed data. The idea behind this method is thatobserved eigenvalues that are higher than their corresponding random eigenvalues are more likely to be from “meaningful factors” than observed eigenvalues that are below their corresponding random eigenvalue.
parallel <- fa.parallel(X)
## Parallel analysis suggests that the number of factors = 3 and the number of components = 3
# other option of scree plot
#scree(X)
Selection of factors from the scree plot can be based on: 1. Kaiser-Guttman normalization rule says that we should choose all factors with an eigenvalue greater than 1. 2. Bend elbow rule
When looking at the parallel analysis scree plots, there are two places to look depending on which type of factor analysis you’re looking to run. The two blue lines show you the observed eigenvalues - they should look identical to the scree plots drawn by the scree function. The red dotted lines show the random eigenvalues or the simulated data line. Each point on the blue line that lies above the corresponding simulated data line is a factor or component to extract. In this analysis, you can see that 4 factors in the “Factor Analysis” parallel analysis lie above the corresponding simulated data line and 3 components in the “Principal Components” parallel analysis lie above the corresponding simulated data line. In our case, however, the last factor/component lies very close to the line - for both principal components extraction and principal axis extraction. Thus, we should probably compare the 3 factor and 4 factor solutions, to see which one is most interpretable.
ev <- eigen(cor(X))
ev$values
## [1] 3.42697133 2.55089671 1.69097648 1.08655606 0.60942409 0.55188378
## [7] 0.40151815 0.24695154 0.20355327 0.13284158 0.09842702
Factor <- rep(1:11)
Eigen_Values <- ev$values
Scree <- data.frame(Factor, Eigen_Values)
ggplot(data = Scree, mapping = aes(x = Factor, y = Eigen_Values)) +
geom_point() +
geom_line() +
scale_y_continuous(name = "Eigen Values", limits = c(0,4)) +
theme_bw() +
theme(panel.grid.major.y = element_line(color = "skyblue")) +
ggtitle("Scree Plot")
So as per the elbow or Kaiser-Guttman normalization rule, we are good to go ahead with 4 factors.
fa.none <- fa(r=X,
nfactors = 4,
# covar = FALSE, SMC = TRUE,
fm="pa", # type of factor analysis we want to use (“pa” is principal axis factoring)
max.iter=100, # (50 is the default, but we have changed it to 100
rotate="none") # none rotation
print(fa.none)
## Factor Analysis using method = pa
## Call: fa(r = X, nfactors = 4, rotate = "none", max.iter = 100, fm = "pa")
## Standardized loadings (pattern matrix) based upon correlation matrix
## PA1 PA2 PA3 PA4 h2 u2 com
## ProdQual 0.20 -0.41 -0.06 0.46 0.42 0.576 2.4
## Ecom 0.29 0.66 0.27 0.22 0.64 0.362 2.0
## TechSup 0.28 -0.38 0.74 -0.17 0.79 0.205 1.9
## CompRes 0.86 0.01 -0.26 -0.18 0.84 0.157 1.3
## Advertising 0.29 0.46 0.08 0.13 0.31 0.686 1.9
## ProdLine 0.69 -0.45 -0.14 0.31 0.80 0.200 2.3
## SalesFImage 0.39 0.80 0.35 0.25 0.98 0.021 2.1
## ComPricing -0.23 0.55 -0.04 -0.29 0.44 0.557 1.9
## WartyClaim 0.38 -0.32 0.74 -0.15 0.81 0.186 2.0
## OrdBilling 0.75 0.02 -0.18 -0.18 0.62 0.378 1.2
## DelSpeed 0.90 0.10 -0.30 -0.20 0.94 0.058 1.4
##
## PA1 PA2 PA3 PA4
## SS loadings 3.21 2.22 1.50 0.68
## Proportion Var 0.29 0.20 0.14 0.06
## Cumulative Var 0.29 0.49 0.63 0.69
## Proportion Explained 0.42 0.29 0.20 0.09
## Cumulative Proportion 0.42 0.71 0.91 1.00
##
## Mean item complexity = 1.9
## Test of the hypothesis that 4 factors are sufficient.
##
## The degrees of freedom for the null model are 55 and the objective function was 6.55 with Chi Square of 619.27
## The degrees of freedom for the model are 17 and the objective function was 0.33
##
## The root mean square of the residuals (RMSR) is 0.02
## The df corrected root mean square of the residuals is 0.03
##
## The harmonic number of observations is 100 with the empirical chi square 3.19 with prob < 1
## The total number of observations was 100 with Likelihood Chi Square = 30.27 with prob < 0.024
##
## Tucker Lewis Index of factoring reliability = 0.921
## RMSEA index = 0.096 and the 90 % confidence intervals are 0.032 0.139
## BIC = -48.01
## Fit based upon off diagonal values = 1
## Measures of factor score adequacy
## PA1 PA2 PA3 PA4
## Correlation of (regression) scores with factors 0.98 0.97 0.95 0.88
## Multiple R square of scores with factors 0.96 0.95 0.91 0.78
## Minimum correlation of possible factor scores 0.92 0.90 0.82 0.56
factanal.none <- factanal(X, factors=4, scores = c("regression"), rotation = "none")
print(factanal.none)
##
## Call:
## factanal(x = X, factors = 4, scores = c("regression"), rotation = "none")
##
## Uniquenesses:
## ProdQual Ecom TechSup CompRes Advertising ProdLine
## 0.682 0.360 0.228 0.178 0.679 0.005
## SalesFImage ComPricing WartyClaim OrdBilling DelSpeed
## 0.017 0.636 0.163 0.347 0.076
##
## Loadings:
## Factor1 Factor2 Factor3 Factor4
## ProdQual 0.467 -0.148 -0.229 -0.159
## Ecom 0.791
## TechSup 0.198 -0.482 0.707
## CompRes 0.589 0.316 0.535 0.300
## Advertising 0.556 0.110
## ProdLine 0.997
## SalesFImage 0.987
## ComPricing -0.491 0.252 0.238
## WartyClaim 0.279 0.124 -0.487 0.712
## OrdBilling 0.452 0.275 0.493 0.360
## DelSpeed 0.629 0.364 0.582 0.241
##
## Factor1 Factor2 Factor3 Factor4
## SS loadings 2.522 2.316 1.469 1.322
## Proportion Var 0.229 0.211 0.134 0.120
## Cumulative Var 0.229 0.440 0.573 0.694
##
## Test of the hypothesis that 4 factors are sufficient.
## The chi square statistic is 24.26 on 17 degrees of freedom.
## The p-value is 0.113
factanal.var <- factanal(X, factors=4, scores = c("regression"), rotation = "varimax")
print(factanal.var)
##
## Call:
## factanal(x = X, factors = 4, scores = c("regression"), rotation = "varimax")
##
## Uniquenesses:
## ProdQual Ecom TechSup CompRes Advertising ProdLine
## 0.682 0.360 0.228 0.178 0.679 0.005
## SalesFImage ComPricing WartyClaim OrdBilling DelSpeed
## 0.017 0.636 0.163 0.347 0.076
##
## Loadings:
## Factor1 Factor2 Factor3 Factor4
## ProdQual 0.557
## Ecom 0.793
## TechSup 0.872 0.102
## CompRes 0.884 0.142 0.135
## Advertising 0.190 0.521 -0.110
## ProdLine 0.502 0.104 0.856
## SalesFImage 0.119 0.974 -0.130
## ComPricing 0.225 -0.216 -0.514
## WartyClaim 0.894 0.158
## OrdBilling 0.794 0.101 0.105
## DelSpeed 0.928 0.189 0.164
##
## Factor1 Factor2 Factor3 Factor4
## SS loadings 2.592 1.977 1.638 1.423
## Proportion Var 0.236 0.180 0.149 0.129
## Cumulative Var 0.236 0.415 0.564 0.694
##
## Test of the hypothesis that 4 factors are sufficient.
## The chi square statistic is 24.26 on 17 degrees of freedom.
## The p-value is 0.113
Factor analysis results are typically interpreted in terms of the major loadings on each factor. These structures may be represented an “interpretable” solution as a table of loadings or graphically, where all loadings with an absolute value \(\rightarrow\) some cut point are represented as an edge (path).
fa.diagram(fa.none)
fa.var <- fa(r=X,
nfactors = 4,
# covar = FALSE, SMC = TRUE,
fm="pa", # type of factor analysis we want to use (“pa” is principal axis factoring)
max.iter=100, # (50 is the default, but we have changed it to 100
rotate="varimax") # none rotation
print(fa.var)
## Factor Analysis using method = pa
## Call: fa(r = X, nfactors = 4, rotate = "varimax", max.iter = 100, fm = "pa")
## Standardized loadings (pattern matrix) based upon correlation matrix
## PA1 PA2 PA3 PA4 h2 u2 com
## ProdQual 0.02 -0.07 0.02 0.65 0.42 0.576 1.0
## Ecom 0.07 0.79 0.03 -0.11 0.64 0.362 1.1
## TechSup 0.02 -0.03 0.88 0.12 0.79 0.205 1.0
## CompRes 0.90 0.13 0.05 0.13 0.84 0.157 1.1
## Advertising 0.17 0.53 -0.04 -0.06 0.31 0.686 1.2
## ProdLine 0.53 -0.04 0.13 0.71 0.80 0.200 1.9
## SalesFImage 0.12 0.97 0.06 -0.13 0.98 0.021 1.1
## ComPricing -0.08 0.21 -0.21 -0.59 0.44 0.557 1.6
## WartyClaim 0.10 0.06 0.89 0.13 0.81 0.186 1.1
## OrdBilling 0.77 0.13 0.09 0.09 0.62 0.378 1.1
## DelSpeed 0.95 0.19 0.00 0.09 0.94 0.058 1.1
##
## PA1 PA2 PA3 PA4
## SS loadings 2.63 1.97 1.64 1.37
## Proportion Var 0.24 0.18 0.15 0.12
## Cumulative Var 0.24 0.42 0.57 0.69
## Proportion Explained 0.35 0.26 0.22 0.18
## Cumulative Proportion 0.35 0.60 0.82 1.00
##
## Mean item complexity = 1.2
## Test of the hypothesis that 4 factors are sufficient.
##
## The degrees of freedom for the null model are 55 and the objective function was 6.55 with Chi Square of 619.27
## The degrees of freedom for the model are 17 and the objective function was 0.33
##
## The root mean square of the residuals (RMSR) is 0.02
## The df corrected root mean square of the residuals is 0.03
##
## The harmonic number of observations is 100 with the empirical chi square 3.19 with prob < 1
## The total number of observations was 100 with Likelihood Chi Square = 30.27 with prob < 0.024
##
## Tucker Lewis Index of factoring reliability = 0.921
## RMSEA index = 0.096 and the 90 % confidence intervals are 0.032 0.139
## BIC = -48.01
## Fit based upon off diagonal values = 1
## Measures of factor score adequacy
## PA1 PA2 PA3 PA4
## Correlation of (regression) scores with factors 0.98 0.99 0.94 0.88
## Multiple R square of scores with factors 0.96 0.97 0.88 0.78
## Minimum correlation of possible factor scores 0.93 0.94 0.77 0.55
fa.var$loadings
##
## Loadings:
## PA1 PA2 PA3 PA4
## ProdQual 0.647
## Ecom 0.787 -0.113
## TechSup 0.883 0.116
## CompRes 0.898 0.130 0.132
## Advertising 0.166 0.530
## ProdLine 0.525 0.127 0.712
## SalesFImage 0.115 0.971 -0.135
## ComPricing 0.213 -0.209 -0.590
## WartyClaim 0.103 0.885 0.128
## OrdBilling 0.768 0.127
## DelSpeed 0.949 0.185
##
## PA1 PA2 PA3 PA4
## SS loadings 2.635 1.967 1.641 1.371
## Proportion Var 0.240 0.179 0.149 0.125
## Cumulative Var 0.240 0.418 0.568 0.692
fa.diagram(fa.var)
Factors | Variables | Label | Short Interpretaation |
---|---|---|---|
PA1 | DelSpeed, CompRes, OrdBilling | Purchase | All the items are related to purchasing the product; from placing order to billing and getting it delivered. |
PA2 | SalesFImage, Ecom, Advertising | Marketing | In this factors the item are related to marketing processes like the image of sales force and spending on adverstising |
PA3 | WartyClaim, TechSup | Post Purchase | Post purchase activities are included in this factor; like warranty claims and technical support. |
PA4 | ProdLine, ProdQual, CompPricing | Product Position | Product positioning related items are grouped in this factor. |
head(fa.var$scores)
## PA1 PA2 PA3 PA4
## [1,] -0.1338871 0.9175166 -1.719604873 0.09135411
## [2,] 1.6297604 -2.0090053 -0.596361722 0.65808192
## [3,] 0.3637658 0.8361736 0.002979966 1.37548765
## [4,] -1.2225230 -0.5491336 1.245473305 -0.64421384
## [5,] -0.4854209 -0.4276223 -0.026980304 0.47360747
## [6,] -0.5950924 -1.3035333 -1.183019401 -0.95913571
#fa.var$scores # for 100 observations
regdata <- cbind(prodsurvey["Satisfaction"], fa.var$scores)
#Labeling the data
names(regdata) <- c("Satisfaction", "Purchase", "Marketing",
"Post_purchase", "Prod_positioning")
head(regdata)
## Satisfaction Purchase Marketing Post_purchase Prod_positioning
## 1 8.2 -0.1338871 0.9175166 -1.719604873 0.09135411
## 2 5.7 1.6297604 -2.0090053 -0.596361722 0.65808192
## 3 8.9 0.3637658 0.8361736 0.002979966 1.37548765
## 4 4.8 -1.2225230 -0.5491336 1.245473305 -0.64421384
## 5 7.1 -0.4854209 -0.4276223 -0.026980304 0.47360747
## 6 4.7 -0.5950924 -1.3035333 -1.183019401 -0.95913571
#Splitting the data 70:30
#Random number generator, set seed.
set.seed(100)
indices= sample(1:nrow(regdata), 0.7*nrow(regdata))
train=regdata[indices,]
test = regdata[-indices,]
model.fa.score = lm(Satisfaction~., train)
summary(model.fa.score)
##
## Call:
## lm(formula = Satisfaction ~ ., data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.76280 -0.48717 0.06799 0.46459 1.24022
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.91852 0.08068 85.757 < 2e-16 ***
## Purchase 0.50230 0.07641 6.574 9.74e-09 ***
## Marketing 0.75488 0.08390 8.998 5.00e-13 ***
## Post_purchase 0.08755 0.08216 1.066 0.291
## Prod_positioning 0.58074 0.08781 6.614 8.30e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6629 on 65 degrees of freedom
## Multiple R-squared: 0.7261, Adjusted R-squared: 0.7093
## F-statistic: 43.08 on 4 and 65 DF, p-value: < 2.2e-16
vif(model.fa.score)
## Purchase Marketing Post_purchase Prod_positioning
## 1.009648 1.008235 1.015126 1.024050
#Model Performance metrics:
pred_test <- predict(model.fa.score, newdata = test, type = "response")
pred_test
## 6 8 11 13 17 19 26 33
## 4.975008 5.908267 6.951629 8.677431 6.613838 6.963113 6.313513 6.141338
## 34 35 37 40 42 44 49 50
## 6.158993 7.415742 6.589746 6.858206 7.133989 8.533080 8.765145 8.078744
## 53 56 57 60 65 67 71 73
## 7.395438 7.468360 8.744402 6.276660 5.936570 6.650322 8.299545 7.685564
## 75 80 96 97 99 100
## 7.330191 6.719528 7.540233 6.143172 8.084583 5.799897
#Find MSE and MAPE scores:
#MSE/ MAPE of Model1
test$Satisfaction_Predicted <- pred_test
head(test[c("Satisfaction","Satisfaction_Predicted")], 10)
## Satisfaction Satisfaction_Predicted
## 6 4.7 4.975008
## 8 6.3 5.908267
## 11 7.4 6.951629
## 13 8.4 8.677431
## 17 6.4 6.613838
## 19 6.8 6.963113
## 26 6.6 6.313513
## 33 5.4 6.141338
## 34 7.3 6.158993
## 35 6.3 7.415742
test_r2 <- cor(test$Satisfaction, test$Satisfaction_Predicted) ^2
mse_test <- mse(test$Satisfaction, pred_test)
rmse_test <- sqrt(mse(test$Satisfaction, pred_test))
mape_test <- mape(test$Satisfaction, pred_test)
model_metrics <- cbind(mse_test,rmse_test,mape_test,test_r2)
print(model_metrics, 3)
## mse_test rmse_test mape_test test_r2
## [1,] 0.551 0.742 0.0875 0.554
As the feature Post_purchase
is not significant so we will drop this feature and then let’s run the regression model again.
##Regression model without post_purchase:
model2 <- lm(Satisfaction ~ Purchase+ Marketing+
Prod_positioning, data= train)
summary(model2)
##
## Call:
## lm(formula = Satisfaction ~ Purchase + Marketing + Prod_positioning,
## data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.7268 -0.5036 0.1117 0.4408 1.2962
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.92158 0.08071 85.761 < 2e-16 ***
## Purchase 0.49779 0.07637 6.518 1.15e-08 ***
## Marketing 0.75870 0.08391 9.042 3.66e-13 ***
## Prod_positioning 0.59084 0.08739 6.761 4.30e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6636 on 66 degrees of freedom
## Multiple R-squared: 0.7213, Adjusted R-squared: 0.7087
## F-statistic: 56.95 on 3 and 66 DF, p-value: < 2.2e-16
pred_test2 <- predict(model2, newdata = test, type = "response")
pred_test2
## 6 8 11 13 17 19 26 33
## 5.069667 5.958765 7.001694 8.695630 6.521440 6.994018 6.249667 6.123357
## 34 35 37 40 42 44 49 50
## 6.132165 7.415882 6.566556 6.964570 7.233376 8.424720 8.853060 8.177009
## 53 56 57 60 65 67 71 73
## 7.413693 7.404078 8.746511 6.215816 5.893515 6.476465 8.227837 7.839941
## 75 80 96 97 99 100
## 7.354907 6.734279 7.559995 6.293774 8.083909 5.832094
test$Satisfaction_Predicted2 <- pred_test2
head(test[c(1,7)], 10)
## Satisfaction Satisfaction_Predicted2
## 6 4.7 5.069667
## 8 6.3 5.958765
## 11 7.4 7.001694
## 13 8.4 8.695630
## 17 6.4 6.521440
## 19 6.8 6.994018
## 26 6.6 6.249667
## 33 5.4 6.123357
## 34 7.3 6.132165
## 35 6.3 7.415882
test_r22 <- cor(test$Satisfaction, test$Satisfaction_Predicted2) ^2
mse_test2 <- mse(test$Satisfaction, pred_test2)
rmse_test2 <- sqrt(mse(test$Satisfaction, pred_test2))
mape_test2 <- mape(test$Satisfaction, pred_test2)
model2_metrics <- cbind(mse_test2,rmse_test2,mape_test2,test_r22)
model2_metrics
## mse_test2 rmse_test2 mape_test2 test_r22
## [1,] 0.5446039 0.7379728 0.08490197 0.5595099
Overall <- rbind(model_metrics,model2_metrics)
row.names(Overall) <- c("Test1", "Test2")
colnames(Overall) <- c("MSE", "RMSE", "MAPE", "R-squared")
print(Overall,3)
## MSE RMSE MAPE R-squared
## Test1 0.551 0.742 0.0875 0.554
## Test2 0.545 0.738 0.0849 0.560
Including Interaction model, we are able to make a better prediction!??
We saw how Factor Analysis can be used to reduce the dimensionality of a dataset and then we used multiple linear regression on the dimensionally reduced columns/Features for further analysis/predictions.