helped by 111078513

Question 1) Earlier, we examined a dataset from a security survey sent to customers of e-commerce websites. However, we only used the eigenvalue > 1 criteria and the screeplot “elbow” rule to find a suitable number of components. Let’s perform a parallel analysis as well this week:

library(readxl)
secQue <- read_excel("C:/R-language/BACS/security_questions.xlsx", sheet = "data")

1-a) Show a single visualization with scree plot of data, scree plot of simulated noise (use average eigenvalues of ≥ 100 noise samples), and a horizontal line showing the eigenvalue = 1 cutoff.

sec_pca <- prcomp(secQue, scale. = TRUE)
sim_noise_ev <- function(n, p) {
  noise <- data.frame(replicate(p, rnorm(n)))
  eigen(cor(noise))$values
}
evalues_noise <- replicate(100, sim_noise_ev(405, 18))
evalues_mean <- apply(evalues_noise, 1, mean)
# Compare eigenvalues ev to averaged eigenvalues of noise:
screeplot(sec_pca, type="lines")
lines(evalues_mean, type="b", col = "blue")
abline(h=1, lty="dotted")

1-b) How many dimensions would you retain if we used Parallel Analysis?

Since there are only 2 PCs have higher eigenvalues than the average “noise”, so I would retain 2 dimensions with using Parallel Analysis.

Question 2) Earlier, we treated the underlying dimensions of the security dataset as composites and examined their eigenvectors (weights). Now, let’s treat them as factors and examine factor loadings (use the principal() method from the psych package)

library(psych)
## Warning: 套件 'psych' 是用 R 版本 4.2.3 來建造的

2-a) Looking at the loadings of the first 3 principal components, to which components does each item seem to best belong?

sec_orig <- principal(secQue,nfactors = 3, rotate = "none", scores = TRUE);sec_orig
## Principal Components Analysis
## Call: principal(r = secQue, nfactors = 3, rotate = "none", scores = TRUE)
## Standardized loadings (pattern matrix) based upon correlation matrix
##      PC1   PC2   PC3   h2   u2 com
## Q1  0.82 -0.14  0.00 0.69 0.31 1.1
## Q2  0.67 -0.01  0.09 0.46 0.54 1.0
## Q3  0.77 -0.03  0.09 0.60 0.40 1.0
## Q4  0.62  0.64  0.11 0.81 0.19 2.1
## Q5  0.69 -0.03 -0.54 0.77 0.23 1.9
## Q6  0.68 -0.10  0.21 0.52 0.48 1.2
## Q7  0.66 -0.32  0.32 0.64 0.36 2.0
## Q8  0.79  0.04 -0.34 0.74 0.26 1.4
## Q9  0.72 -0.23  0.20 0.62 0.38 1.4
## Q10 0.69 -0.10 -0.53 0.76 0.24 1.9
## Q11 0.75 -0.26  0.17 0.66 0.34 1.4
## Q12 0.63  0.64  0.12 0.82 0.18 2.1
## Q13 0.71 -0.06  0.08 0.52 0.48 1.0
## Q14 0.81 -0.10  0.16 0.69 0.31 1.1
## Q15 0.70  0.01 -0.33 0.61 0.39 1.4
## Q16 0.76 -0.20  0.18 0.65 0.35 1.3
## Q17 0.62  0.66  0.11 0.83 0.17 2.0
## Q18 0.81 -0.11 -0.07 0.67 0.33 1.1
## 
##                        PC1  PC2  PC3
## SS loadings           9.31 1.60 1.15
## Proportion Var        0.52 0.09 0.06
## Cumulative Var        0.52 0.61 0.67
## Proportion Explained  0.77 0.13 0.10
## Cumulative Proportion 0.77 0.90 1.00
## 
## Mean item complexity =  1.5
## Test of the hypothesis that 3 components are sufficient.
## 
## The root mean square of the residuals (RMSR) is  0.05 
##  with the empirical chi square  258.65  with prob <  1.4e-15 
## 
## Fit based upon off diagonal values = 0.99

According to the results, all the items would belong to PC1. Some items such as Q4, Q12, Q17 could also belong to PC2.

2-b) How much of the total variance of the security dataset do the first 3 PCs capture?

sum(sec_orig$loadings[,1:3]^2)
## [1] 12.05684

2-c) Looking at commonality and uniqueness, which items are less than adequately explained by the first 3 principal components?

is_adeq_exp <- apply(sec_orig$loadings[, 1:3]^2, 1, sum) < 0.55
row.names(sec_orig$loadings)[is_adeq_exp]
## [1] "Q2"  "Q6"  "Q13"

If we set the indicator that loadings of items lower than 0.55 are less than adequately explained, there would be “Q2”, “Q6”, “Q13”.

2-d) How many measurement items share similar loadings between 2 or more components?

“Q4”, “Q12”, “Q17” share similar loadings between 2 or more components.

2-e) Can you interpret a ‘meaning’ behind the first principal component from the items that load best upon it? (see the wording of the questions of those items)

The higher loading of the factors, the higher quality and correlation that this question equipped. Q1, Q14, Q18 are captured over 0.8 in PC1, which have the similar intention that user believe that the platform secure their transactions information.

Question 3) To improve interpretability of loadings, let’s rotate our principal component axes using the varimax technique to get rotated components (extract and rotate only three principal components)

3-a) Individually, does each rotated component (RC) explain the same, or different, amount of variance than the corresponding principal components (PCs)?

sec_pca_rot <- principal(secQue, nfactor = 3, rotate = "varimax", scores = TRUE)
comparison_PC_RC <- t(data.frame(sec_orig$values[1:3], sec_pca_rot$values[1:3]))
row.names(comparison_PC_RC) <-c("PCs", "RCs");comparison_PC_RC
##         [,1]     [,2]     [,3]
## PCs 9.310953 1.596332 1.149558
## RCs 9.310953 1.596332 1.149558

According to the results, it is noticed that the amount of variance is explained similarly in each PCs and RCs.

3-b) Together, do the three rotated components explain the same, more, or less cumulative variance as the three principal components combined?

apply(comparison_PC_RC,1,sum)
##      PCs      RCs 
## 12.05684 12.05684

As we can see, the three rotated components explain the same cumulative variance as the three principal components combined.

3-c) Looking back at the items that shared similar loadings with multiple principal components (#2d), do those items have more clearly differentiated loadings among rotated components?

sec_pca_rot$Structure[c(4, 12, 17), c(1, 3)]
##           RC1       RC2
## Q4  0.2182880 0.8536838
## Q12 0.2327616 0.8542346
## Q17 0.2054021 0.8703910

Yes, those items have more clearly differentiated loadings among rotated components.

3-d) Can you now more easily interpret the “meaning” of the 3 rotated components from the items that load best upon each of them? (see the wording of the questions of those items)

sec_pca_rot_struc <- sec_pca_rot$Structure[, c(1, 3, 2)]
sec_pca_rot_struc_col <- colnames(sec_pca_rot_struc)
sec_pca_rot_struc_row <- row.names(sec_pca_rot_struc)

sec_pca_rot_struc_col[1]
## [1] "RC1"
is_adeq_exp_sec <- sec_pca_rot_struc[,1] > 0.5
sec_pca_rot_struc_row[is_adeq_exp_sec]
##  [1] "Q1"  "Q2"  "Q3"  "Q6"  "Q7"  "Q9"  "Q11" "Q13" "Q14" "Q16" "Q18"
sec_pca_rot_struc_col[2]
## [1] "RC2"
is_adeq_exp_sec <- sec_pca_rot_struc[,2] > 0.5
sec_pca_rot_struc_row[is_adeq_exp_sec]
## [1] "Q4"  "Q12" "Q17"
sec_pca_rot_struc_col[3]
## [1] "RC3"
is_adeq_exp_sec <- sec_pca_rot_struc[,3] > 0.5
sec_pca_rot_struc_row[is_adeq_exp_sec]
## [1] "Q5"  "Q8"  "Q10" "Q15"

Now I can easily interpret each items and pur them to the PC which they belong to.

3-e) If we reduced the number of extracted and rotated components to 2, does the meaning of our rotated components change?

sec_pca_rot2 <- principal(secQue, nfactor = 2, rotate = "varimax", scores = TRUE)
sec_pca_rot_struc2 <- sec_pca_rot2$Structure[, c(1,2)]
sec_pca_rot_struc_col2 <- colnames(sec_pca_rot_struc2)
sec_pca_rot_struc_row2 <- row.names(sec_pca_rot_struc2)

sec_pca_rot_struc_col2[1]
## [1] "RC1"
is_adeq_exp_sec2 <- sec_pca_rot_struc2[,1] > 0.5
sec_pca_rot_struc_row2[is_adeq_exp_sec2]
##  [1] "Q1"  "Q2"  "Q3"  "Q5"  "Q6"  "Q7"  "Q8"  "Q9"  "Q10" "Q11" "Q13" "Q14"
## [13] "Q15" "Q16" "Q18"
sec_pca_rot_struc_col2[2]
## [1] "RC2"
is_adeq_exp_sec2 <- sec_pca_rot_struc2[,2] > 0.5
sec_pca_rot_struc_row2[is_adeq_exp_sec2]
## [1] "Q4"  "Q12" "Q17"

If we reduce the number of components to 2, then the origin RC3 will combine into RC1 to become a new RC1, and new RC2 still capture and explain the Q4,Q12 and Q17.