PCA is used in exploratory data analysis and make predictive models. It is commonly used for dimensoinaly reduction while minimizing information lost. Principal compentents are a collection of vectors, where vector \(i\) is orthogonal to the first \(i-1\) vectors. In PCA, we do not need to break up independent variable or dependent variable.
\
We would like our data comes from multivariate normal distribution with zero covariances. Bartlett’s test checks if the observed correlation matrix R diverges significantly from the identity matrix. H0: The variables are orthogonal. (Correlation matrix does NOT diverge from identity matrix) if \(p-value>0.05\) it means No dimension need to reduce.
The example from “data” shows \(p-value =0.15\), which means the dataset is linearly independent already, and we don’t need to reduce the dimension
## $chisq
## [1] 5.252329
##
## $p.value
## [1] 0.1542258
##
## $df
## [1] 3
KMO index checks if we can have enough sample size to factorize efficiently of the original variables. Ideally, there should be 150+ data and there should be ratio of at least five rows for each variable \ * General Rule: 20 observations for one variable.
KMO Level: 0.00 to 0.49 unacceptable. 0.50 to 0.59 miserable. 0.60 to 0.69 mediocre. 0.70 to 0.79 middling. 0.80 to 0.89 meritorious. 0.90 to 1.00 marvelous.
The example below shows the Bartlett test \(p-value < 0.05\) means we should apply PCA, which is consistent with the data structure, x4 and x1 , x5 and x2 are highly correlated. The KMO= 0.52 which means we need more data for a better analysis.
set.seed(0)
n <-50
data1 <-data.frame(
x1=rnorm(n),
x2=rnorm(n),
x3=rnorm(n)
)
data1$x4 <-data1$x1 +runif(n,min=-0.5, max=0.5)
data1$x5 <-data1$x2 +runif(n,min=-0.5, max=0.5)
#Test correlation
r=cor(data1)
cortest.bartlett(r, n=nrow(data1))
## $chisq
## [1] 220.7146
##
## $p.value
## [1] 7.573688e-42
##
## $df
## [1] 10
## [1] 0.52928
\[Det(A) >0 \] : Linear Independent
We can see 3- vector is good enough for our predictions
## PC1 PC2 PC3 PC4 PC5
## 0.47783 0.81072 0.97964 0.99027 1.00000
## Parallel analysis suggests that the number of factors = NA and the number of components = 0
From the loading cofficients, we can see x1 and x4 , x2 and x5 are closely related.
## x1 x2 x3 x4 x5
## 0.80973 0.66308 0.37330 0.80689 0.70949
## x1 x2 x3 x4 x5
## -0.50919 0.73101 -0.36921 -0.51877 0.68218
## x1 x2 x3 x4 x5
## -0.244616 0.024876 0.851002 -0.233529 0.073751
q <- data.frame (
var=c("x1", "x2","x3","x4", "x5","x6","x7", "x8","x9","x10", "x11","x12" ),
desc=c("My job pays me well.",
"I have my career well planed out.",
"I would do anything to win my boss’ approval.",
" This is the best job I have ever had.",
" I find my work tedious." ,
" My job provides me with a sense of achievement.",
" I perform well in competitive situations.",
" I think its unfair to promote a person simply because he is more senior.",
" I am happy with my job.",
" I hate to be in a responsible position with several people reporting to me.",
" I am quite content with what I have achieved with my job.",
" I would leave my job for another offer that pays better ")
)
q
we can see x1 and x12 are similar questions, next time we can use model to test it!
ref: Mark Newman’s class material