1 PCA: Principal Component Analysis Brief Review

PCA is used in exploratory data analysis and make predictive models. It is commonly used for dimensoinaly reduction while minimizing information lost. Principal compentents are a collection of vectors, where vector \(i\) is orthogonal to the first \(i-1\) vectors. In PCA, we do not need to break up independent variable or dependent variable.

1.2 PCA Assumption

\

1.2.1 Bartlett’s Test of Sphericity : Equal Variances

We would like our data comes from multivariate normal distribution with zero covariances. Bartlett’s test checks if the observed correlation matrix R diverges significantly from the identity matrix. H0: The variables are orthogonal. (Correlation matrix does NOT diverge from identity matrix) if \(p-value>0.05\) it means No dimension need to reduce.

The example from “data” shows \(p-value =0.15\), which means the dataset is linearly independent already, and we don’t need to reduce the dimension

## $chisq
## [1] 5.252329
## 
## $p.value
## [1] 0.1542258
## 
## $df
## [1] 3

1.2.2 Sample Adequacy:

KMO index checks if we can have enough sample size to factorize efficiently of the original variables. Ideally, there should be 150+ data and there should be ratio of at least five rows for each variable \ * General Rule: 20 observations for one variable.

KMO Level: 0.00 to 0.49 unacceptable. 0.50 to 0.59 miserable. 0.60 to 0.69 mediocre. 0.70 to 0.79 middling. 0.80 to 0.89 meritorious. 0.90 to 1.00 marvelous.

The example below shows the Bartlett test \(p-value < 0.05\) means we should apply PCA, which is consistent with the data structure, x4 and x1 , x5 and x2 are highly correlated. The KMO= 0.52 which means we need more data for a better analysis.

## $chisq
## [1] 220.7146
## 
## $p.value
## [1] 7.573688e-42
## 
## $df
## [1] 10
## [1] 0.52928

1.2.3 Postivie Determinant

\[Det(A) >0 \] : Linear Independent

1.3 PCA Procedure

We can see 3- vector is good enough for our predictions

##     PC1     PC2     PC3     PC4     PC5 
## 0.47783 0.81072 0.97964 0.99027 1.00000

## Parallel analysis suggests that the number of factors =  NA  and the number of components =  0

From the loading cofficients, we can see x1 and x4 , x2 and x5 are closely related.

##      x1      x2      x3      x4      x5 
## 0.80973 0.66308 0.37330 0.80689 0.70949
##       x1       x2       x3       x4       x5 
## -0.50919  0.73101 -0.36921 -0.51877  0.68218
##        x1        x2        x3        x4        x5 
## -0.244616  0.024876  0.851002 -0.233529  0.073751

1.4 PCA Survey Example

q <- data.frame (
     var=c("x1", "x2","x3","x4", "x5","x6","x7", "x8","x9","x10", "x11","x12" ),
     desc=c("My job pays me well.", 
"I have my career well planed out.",
"I would do anything to win my boss’ approval.",
"   This is the best job I have ever had.",
" I find my work tedious." ,
"   My job provides me with a sense of achievement.",
"   I perform well in competitive situations.",
"   I think its unfair to promote a person simply because he is more senior.",
"   I am happy with my job.",
"   I hate to be in a responsible position with several people reporting to me.",
"   I am quite content with what I have achieved with my job.",
"   I would leave my job for another offer that pays better ")
     )
q

we can see x1 and x12 are similar questions, next time we can use model to test it!

ref: Mark Newman’s class material