PCA and EFA have these assumptions in common: * Measurement scale is interval or ratio level
* Random sample - at least 5 observations per observed variable and at least 100 observations
* Larger sample sizes recommended for more stable estimates, 10-20 observations per observed variable
* Over sample to compensate for missing values
* Linear relationship between observed variables
* Normal distribution for each observed variable
* Each pair of observed variables has a bivariate normal distribution
* PCA and EFA are both variable reduction techniques. If communalities are large, close to 1.00, results could be similar.
PCA assumes the absence of outliers in the data. EFA assumes a multivariate normal distribution when using Maximum Likelihood extraction method.
Principal Component Analysis (PCA) PCA decomposes a correlation matrix with ones on the diagonals. The amount of variance is equal to the trace of the matrix, the sum of the diagonals, or the number of observed variables in the analysis. PCA minimizes the sum of the squared perpendicular distance to the component axis. Components are uninterpretable, e.g., no underlying constructs.
Principal components retained account for a maximal amount of variance. The component score is a linear combination of observed variables weighted by eigenvectors.
Exploratory Factor Analysis EFA decomposes an adjusted correlation matrix. The diagonals have been adjusted for the unique factors. The amount of variance explained is equal to the trace of the matrix, the sum of the adjusted diagonals or communalities.
Factors account for common variance in a data set. Squared multiple correlations (SMC) are used as communality estimates on the diagonals. Observed variables are a linear combination of the underlying and unique factors.
Load data
library(readr)
iris <- read_csv("D:/Analytics/BACP-Dec2017/08_PCA_and_FA/iris.csv")
## Parsed with column specification:
## cols(
## sepal_len = col_double(),
## sepal_wid = col_double(),
## petal_len = col_double(),
## petal_wid = col_double(),
## class = col_character()
## )
Check if the data is numerical or categorical
str(iris)
## Classes 'tbl_df', 'tbl' and 'data.frame': 150 obs. of 5 variables:
## $ sepal_len: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ sepal_wid: num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ petal_len: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ petal_wid: num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ class : chr "Iris-setosa" "Iris-setosa" "Iris-setosa" "Iris-setosa" ...
## - attr(*, "spec")=List of 2
## ..$ cols :List of 5
## .. ..$ sepal_len: list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ sepal_wid: list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ petal_len: list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ petal_wid: list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ class : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## ..$ default: list()
## .. ..- attr(*, "class")= chr "collector_guess" "collector"
## ..- attr(*, "class")= chr "col_spec"
Missing value treatment
any(is.na.data.frame(iris))
## [1] FALSE
Check the Normality assummption
shapiro.test(iris$sepal_len)
##
## Shapiro-Wilk normality test
##
## data: iris$sepal_len
## W = 0.97609, p-value = 0.01018
shapiro.test(iris$sepal_wid)
##
## Shapiro-Wilk normality test
##
## data: iris$sepal_wid
## W = 0.98379, p-value = 0.07518
shapiro.test(iris$petal_len)
##
## Shapiro-Wilk normality test
##
## data: iris$petal_len
## W = 0.87642, p-value = 7.545e-10
shapiro.test(iris$petal_wid)
##
## Shapiro-Wilk normality test
##
## data: iris$petal_wid
## W = 0.90262, p-value = 1.865e-08
From the output, the p-value > 0.05 implying that the distribution of the data are not significantly different from normal distribution. In other words, we can assume the normality. In this case only sepal_wid qualifies normality.
Histogram
par(mfrow=(c(2,2)))
hist(iris$sepal_len)
hist(iris$sepal_wid)
hist(iris$petal_len)
hist(iris$petal_wid)
Standardize the variables prior to the application of PCA. Since skewness and the magnitude of the variables influence the resulting PCs, it is good practice to apply skewness transformation, center and scale the variables prior to the application of PCA.
Here we applied a log transformation to the variables but we could have been more general and applied a Box and Cox transformation.
Applying log transformation
iris_pca<-log(iris[,1:4])
iris_class<-(iris[,5])
Pair wise scatterplots
pairs(iris_pca)
Bartlett Sphericity Test: To check if data reduction is possible
The null hypothesis is that the data dimension reduction is not possible.
If p-value is less than 0.05, dimension reduction is possible.
library(psych)
## Warning: package 'psych' was built under R version 3.4.3
print(cortest.bartlett(cor(iris_pca), nrow(iris_pca)))
## $chisq
## [1] 669.2255
##
## $p.value
## [1] 2.692699e-141
##
## $df
## [1] 6
The functions prcomp() use the singular value decomposition (Singular value decomposition examines the covariances / correlations between individuals).
Arguments for prcomp():
* x: a numeric matrix or data frame
* scale: a logical value indicating whether the variables should be scaled to have unit variance before the analysis takes place.
Implement PCA
iris_comp<-prcomp(iris_pca,scale = T)
print(iris_comp)
## Standard deviations (1, .., p=4):
## [1] 1.7087152 0.9563817 0.3700768 0.1693209
##
## Rotation (n x k) = (4 x 4):
## PC1 PC2 PC3 PC4
## sepal_len 0.5059018 -0.44704259 0.7084146 0.2058278
## sepal_wid -0.2959102 -0.89323900 -0.3225722 -0.1025106
## petal_len 0.5781192 -0.02821947 -0.2010441 -0.7902930
## petal_wid 0.5676959 -0.03847958 -0.5947077 0.5679467
The output returns the standard deviation of each of the four PCs, and their rotation (or loadings), which are the coefficients of the linear combinations of the continuous variables.
Ploting the resultant PCs
biplot(iris_comp, scale = 0)
Identify PCs that explain maximum variance
plot(iris_comp,type="lines")
The plot method returns a plot of the variances (y-axis) associated with the PCs (x-axis). The plot is useful to decide how many PCs to retain for further analysis.
In this simple case with only 4 PCs this is not a hard task and we can see that the first two PCs explain most of the variability in the data.
summary(iris_comp)
## Importance of components%s:
## PC1 PC2 PC3 PC4
## Standard deviation 1.7087 0.9564 0.37008 0.16932
## Proportion of Variance 0.7299 0.2287 0.03424 0.00717
## Cumulative Proportion 0.7299 0.9586 0.99283 1.00000
The summary method describe the importance of the PCs.
* The first row describe the standard deviation associated with each PC.
* The second row shows the proportion of the variance in the data explained by each component.
* The third row describe the cumulative proportion of explained variance.
We can see there that the first two PCs accounts for more than 95% of the variance of the data.
Visualize eigenvalues (scree plot)
Show the percentage of variances explained by each principal component
#install.packages("factoextra")
library(factoextra)
## Warning: package 'factoextra' was built under R version 3.4.3
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 3.4.3
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
## Welcome! Related Books: `Practical Guide To Cluster Analysis in R` at https://goo.gl/13EFCZ
fviz_eig(iris_comp)
Graph of individuals
Individuals with a similar profile are grouped together
fviz_pca_ind(iris_comp,
col.ind = "cos2", # Color by the quality of representation
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping
)
Graph of variables
Positive correlated variables point to the same side of the plot. Negative correlated variables point to opposite sides of the graph.
fviz_pca_var(iris_comp,
col.var = "contrib", # Color by contributions to the PC
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping
)
Biplot of individuals and variables
fviz_pca_biplot(iris_comp, repel = TRUE,
col.var = "#2E9FDF", # Variables color
col.ind = "#696969" # Individuals color
)
Use the predict function to predict the PCs for the new data. Just for illustration pretend the last two rows of the iris data has just arrived and we want to see what is their PCs values.
#Last two row of transformed iris data
tail(iris_pca,2)
## # A tibble: 2 x 4
## sepal_len sepal_wid petal_len petal_wid
## <dbl> <dbl> <dbl> <dbl>
## 1 1.82 1.22 1.69 0.833
## 2 1.77 1.10 1.63 0.588
#Loading values of the PCs
print(iris_comp)
## Standard deviations (1, .., p=4):
## [1] 1.7087152 0.9563817 0.3700768 0.1693209
##
## Rotation (n x k) = (4 x 4):
## PC1 PC2 PC3 PC4
## sepal_len 0.5059018 -0.44704259 0.7084146 0.2058278
## sepal_wid -0.2959102 -0.89323900 -0.3225722 -0.1025106
## petal_len 0.5781192 -0.02821947 -0.2010441 -0.7902930
## petal_wid 0.5676959 -0.03847958 -0.5947077 0.5679467
#PCs of last two rows of iris data
PC1<-predict(iris_comp, newdata=tail(iris_pca, 2))
PC1
## PC1 PC2 PC3 PC4
## [1,] 1.0831857 -1.01892520 -0.6989284 -0.09009126
## [2,] 0.9692526 -0.06419764 -0.4976021 -0.13605182
Singular value decomposition can be thought of as a method that transforms correlated variables into a set of uncorrelated variables, enabling one to better analyze the relationships of the original data.
Eigen values(variance explained by a component) and Eigen vectors(direction where the maximum variance is explained)
cor_iris<-cor(iris_pca)
cor_iris
## sepal_len sepal_wid petal_len petal_wid
## sepal_len 1.0000000 -0.1037456 0.8413009 0.7999209
## sepal_wid -0.1037456 1.0000000 -0.4652183 -0.4344305
## petal_len 0.8413009 -0.4652183 1.0000000 0.9627361
## petal_wid 0.7999209 -0.4344305 0.9627361 1.0000000
eigen_iris<-eigen(cor_iris)
Eigen value
eigen_iris$values
## [1] 2.91970753 0.91466605 0.13695687 0.02866955
Proportion of variance explained
iris.var.prop<-eigen_iris$values/sum(eigen_iris$values)
iris.var.prop
## [1] 0.729926882 0.228666513 0.034239217 0.007167388
Variance explained by the 1st component: 72.9%
Variance explained by the 2nd component: 22.8%
Variance explained by the 3rd component: 03.4%
Variance explained by the 4th component: 00.7%
Eigen vectors
eigen_iris$vectors
## [,1] [,2] [,3] [,4]
## [1,] 0.5059018 -0.44704259 0.7084146 0.2058278
## [2,] -0.2959102 -0.89323900 -0.3225722 -0.1025106
## [3,] 0.5781192 -0.02821947 -0.2010441 -0.7902930
## [4,] 0.5676959 -0.03847958 -0.5947077 0.5679467
The output returns the rotation (or loadings), which are the coefficients of the linear combinations of the continuous variables.
How many components to select?
Scree plot is used to help us choose the number of components we should select, considering the variability of data explained.
plot(eigen_iris$values, xlab = "Principal Component", ylab = "Eigen Values", type = "b")
plot(cumsum(iris.var.prop), xlab = "Principal Component", ylab = "Cumulative Eigen Values", type = "b")
The function princomp() uses the spectral decomposition approach(Spectral decomposition examines the covariances / correlations between variables).
Component loadings
Percentage of variance in a variables, explained by a component.
iris_comp2<-princomp(iris[,1:4],scores=T)
iris_comp2$loadings
##
## Loadings:
## Comp.1 Comp.2 Comp.3 Comp.4
## sepal_len 0.362 -0.657 0.581 0.317
## sepal_wid -0.730 -0.596 -0.324
## petal_len 0.857 0.176 -0.480
## petal_wid 0.359 -0.549 0.751
##
## Comp.1 Comp.2 Comp.3 Comp.4
## SS loadings 1.00 1.00 1.00 1.00
## Proportion Var 0.25 0.25 0.25 0.25
## Cumulative Var 0.25 0.50 0.75 1.00
Total variance in component 1 is explained by sepal_len, petal_len and petal_wid.
sepal_len loads to all the four components.
Scores: obtain the transformed data
iris_comp2_scores<-iris_comp2$scores
head(iris_comp2_scores)
## Comp.1 Comp.2 Comp.3 Comp.4
## [1,] -2.684207 -0.3266073 0.02151184 0.001006157
## [2,] -2.715391 0.1695568 0.20352143 0.099602424
## [3,] -2.889820 0.1373456 -0.02470924 0.019304543
## [4,] -2.746437 0.3111243 -0.03767198 -0.075955274
## [5,] -2.728593 -0.3339246 -0.09622970 -0.063128733
## [6,] -2.279897 -0.7477827 -0.17432562 -0.027146804
A Factor Analysis approaches data reduction in a fundamentally different way. It is a model of the measurement of a latent variable. This latent variable cannot be directly measured with a single variable (think: intelligence, social anxiety, soil health). Instead, it is seen through the relationships it causes in a set of Y variables.
For example, we may not be able to directly measure social anxiety. But we can measure whether social anxiety is high or low with a set of variables like “I am uncomfortable in large groups” and “I get nervous talking with strangers.” People with high social anxiety will give similar high responses to these variables because of their high social anxiety. Likewise, people with low social anxiety will give similar low responses to these variables because of their low social anxiety.
Steps in Factor Analysis
A) Test assumptions of Factor Analysis such as Factorability and Sample size
B) Exploratory Factor Analysis (EFA)
C) Determine the number of Factors
D) Identify which item belong to which factor
E) Drop items as necessary and repeat steps C and D
F) Name and define factors.
G) Examine correlations among factors
H) Analyze internal reliability
Load required libraries
#install.packages("corpcor",repos = "http://cran.us.r-project.org")
#install.packages("GPArotation",repos = "http://cran.us.r-project.org")
#install.packages("psych",repos = "http://cran.us.r-project.org")
#install.packages("ggplot2",repos = "http://cran.us.r-project.org")
#install.packages("MASS",repos = "http://cran.us.r-project.org")
#install.packages("MVN",repos = "http://cran.us.r-project.org")
#install.packages("psy",repos = "http://cran.us.r-project.org")
#install.packages("REdaS",repos = "http://cran.us.r-project.org")
#install.packages("grid",repos = "http://cran.us.r-project.org")
library(corpcor)
library(GPArotation)
library(psych)
library(ggplot2)
library(MASS)
library(MVN)
## Warning: package 'MVN' was built under R version 3.4.3
## sROC 0.1-2 loaded
##
## Attaching package: 'MVN'
## The following object is masked from 'package:psych':
##
## mardia
library(psy)
##
## Attaching package: 'psy'
## The following object is masked from 'package:psych':
##
## wkappa
library(grid)
library(REdaS)
## Warning: package 'REdaS' was built under R version 3.4.3
Load data
Pain relief data has 6 variables and we are required to do a factor analysis on these 6 variables to identify if there are highly correlated variables that can be grouped as single component or factor.
All the variables are of type numeric and are continuous in nature.
Following variables names are used for our analysis in R:
* does.not.upset.stomach
* no.bad.side.effects
* stops.the.pain
* works.quickly
* keeps.me.awake
* provides.limited.relief
painReliefData <- read.csv("D:/Analytics/BACP-Dec2017/08_PCA_and_FA/Pain_Relief.csv")
str(painReliefData)
## 'data.frame': 100 obs. of 6 variables:
## $ does.not.upset.stomach: num -2.6075 -2.5977 -0.6027 0.0515 -2.5975 ...
## $ no.bad.side.effects : num -2.0409 0.5938 0.0568 -1.4052 0.3321 ...
## $ stops.the.pain : num 0.668 -0.69 -2.635 -1.335 -0.825 ...
## $ works.quickly : num 0.405 -0.814 -3.839 -1.148 -1.125 ...
## $ keeps.me.awake : num 1.3211 -0.5492 -0.5083 -0.0914 2.1677 ...
## $ limited.relief : num -0.351 2.948 1.526 1.522 0.278 ...
It is the assumption that there are at least some correlations amongst the variables so that coherent factors can be identified. Basically, there should be some degree of collinearity among the variables but not an extreme degree or singularity among the variables.
Factorability can be examined via any of the following:
a. Inter-item correlations (correlation matrix) - are there at least several sizable correlations - e.g. > 0.5?
b. Anti-image correlation matrix diagonals - they should be > 0.5
c. Measure of Sampling Accuracy (MSA)using KMO Test
d. Bartlett’s test of sphericity (Should be significant)
A. Correlation Matrix
To do the factor analysis we must have variables that correlate fairly well with each other. The correlation matrix is generated in R to check the pattern of relationship between variables.
PainReliefMatrix <- cor(painReliefData)
round(PainReliefMatrix,2)
## does.not.upset.stomach no.bad.side.effects
## does.not.upset.stomach 1.00 0.60
## no.bad.side.effects 0.60 1.00
## stops.the.pain 0.16 0.13
## works.quickly 0.09 -0.07
## keeps.me.awake -0.59 -0.64
## limited.relief -0.17 -0.06
## stops.the.pain works.quickly keeps.me.awake
## does.not.upset.stomach 0.16 0.09 -0.59
## no.bad.side.effects 0.13 -0.07 -0.64
## stops.the.pain 1.00 0.63 -0.03
## works.quickly 0.63 1.00 0.06
## keeps.me.awake -0.03 0.06 1.00
## limited.relief -0.63 -0.61 0.03
## limited.relief
## does.not.upset.stomach -0.17
## no.bad.side.effects -0.06
## stops.the.pain -0.63
## works.quickly -0.61
## keeps.me.awake 0.03
## limited.relief 1.00
From the above correlation matrix we can see that all the variables value of correlation coefficient is greater than 0.5 with at least one other variable. Hence we can assume that variables are fairly correlated with each other and we can run Factor Analysis on this data.
B. Anti-image correlation matrix diagonals
X <- cor(painReliefData)
iX <- ginv(X)
S2 <- diag(diag((iX^-1)))
AIS <- S2%*%iX%*%S2 # anti-image covariance matrix
IS <- X+AIS-2*S2 # image covariance matrix
Dai <- sqrt(diag(diag(AIS)))
IR <- ginv(Dai)%*%IS%*%ginv(Dai) # image correlation matrix
AIR <- ginv(Dai)%*%AIS%*%ginv(Dai) # anti-image correlation matrix
print(diag(AIR), row.names = FALSE)
## [1] 1 1 1 1 1 1
Anti-image correlation matrix diagnols should be >0.5.
We observe that the diagonals of the Anti - Image Correlation matrix is 1,we can run Factor Analysis on this data.
C. Measure of Sampling Accuracy (MSA) using KMO Test
Kaisar-Meyer-Olkin (KMO) should be > 0.5
PainReliefMatrix <- cor(painReliefData)
KMO(PainReliefMatrix)
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = PainReliefMatrix)
## Overall MSA = 0.71
## MSA for each item =
## does.not.upset.stomach no.bad.side.effects stops.the.pain
## 0.75 0.67 0.70
## works.quickly keeps.me.awake limited.relief
## 0.70 0.70 0.75
Kaiser Meyer Olkin (KMO) test is a measure of how suited the data is, for factor analysis. This statistic measures the proportion of total variance among variables, which might be common variance.
KMO returns values between 0 and 1
* Values between 0.8 and 1 indicate that the sampling is adequate.
* Values between 0.5 to 0.8 indicate that the sampling is not adequate and data should be corrected before conducting EFA.
* Values below 0.5 indicate that there are large partial correlations, which is a problem for factor analysis.
In words of Kaiser:
* 0.00 to 0.49 unacceptable
* 0.50 to 0.59 miserable
* 0.60 to 0.69 mediocre
* 0.70 to 0.79 middling
* 0.80 to 0.89 meritorious
* 0.90 to 1.00 marvelous
In this case:
* Overall MSA to be 0.71 which yields a degree of common variance middling
* The estimates of MSA for each item to be 0.75, 0.67, 0.70, 0.70. 0.70 and 0.75
Since MSA >0.5, we can run Factor Analysis on this data.
D. Bartlett’s Test of Sphericity
PainReliefMatrix <- cor(painReliefData)
cortest.bartlett(PainReliefMatrix)
## Warning in cortest.bartlett(PainReliefMatrix): n not specified, 100 used
## $chisq
## [1] 226.1
##
## $p.value
## [1] 1.006033e-39
##
## $df
## [1] 15
Bartletta’s test was conducted in R and it was found to be significant (P < 0.001) . The significance of this test tells us that the correlation matrix is not an identity matrix. Hence, we assume that there is some relationship between the variable.
The sample size should be large enough to yield reliable estimates of correlations among the variables:
* Ideally, there should be a large ratio of N / k (Cases / items)
* EFA can still be reasonably done with > ~ 5:1
* Bare min. for pilot study purposes as low as ~3:1
In this case we have N = 100 and k = 6 and the ratio is 100 : 6 = ~ 17 : 1.
Hence the sample size is large enough to yield reliable estimates of the correlations among the variables.
Exploratory Factor Analysis (EFA) is generally used to discover structure of a measure, and, to examine its internal reliability. EFA is often recommended when researchers have no hypotheses about the nature of the underlying factor structure of their measure.
Exploratory factor analysis has three basic decision points:
(1) Decide the number of factors
(2) Choosing an extraction method
(3) Choosing a rotation method
Decide the number of factors
The most common approach to deciding the number of factors is to generate a scree plot (graph with factors on the x-axis and eigenvalues on the y-axis).
Choosing an extraction method and extraction
Once the number of factors are decided, you need to decide which mathematical solution to find the loadings. There are five basic extraction methods:
1. PCA - which assumes there is no measurement error and is considered not to be true exploratory factor analysis.
2. Maximum Likelihood (canonical factoring)
3. Alpha Factoring
4. Image Factoring
5. Principal axis factoring with iterated communalities (least squares)
This can be done in a number of different ways: the two most common methods are:
* Principal Component Analysis (PCA)
* Principal Axis Factoring
This is a method which tries to find the lowest number of factors which can account for the variability in the original variables, that is associated with these factors (this is in contrast to the principal components method which looks for a set of factors which can account for the total variability in the original variables). These two methods will tend to give similar results if the variables are quite highly correlated and / or the number of original variables is quite high. Whichever method is used, the resulting factors at this stage will be uncorrelated.
PCA
Let us perform an initial PCA and understand the eigen values and the variance explained by them.
pc1 <- principal(painReliefData,nfactors = length(painReliefData), rotate = "none")
pc1
## Principal Components Analysis
## Call: principal(r = painReliefData, nfactors = length(painReliefData),
## rotate = "none")
## Standardized loadings (pattern matrix) based upon correlation matrix
## PC1 PC2 PC3 PC4 PC5 PC6 h2 u2 com
## does.not.upset.stomach 0.67 -0.52 -0.39 0.30 0.18 -0.09 1 0.0e+00 3.2
## no.bad.side.effects 0.59 -0.64 0.36 0.06 0.08 0.31 1 7.8e-16 3.1
## stops.the.pain 0.71 0.51 0.31 0.00 0.23 -0.30 1 -2.2e-16 3.0
## works.quickly 0.60 0.63 -0.25 -0.30 0.17 0.25 1 -1.1e-15 3.3
## keeps.me.awake -0.55 0.68 0.05 0.39 0.23 0.18 1 1.1e-16 3.1
## limited.relief -0.68 -0.52 0.00 -0.22 0.45 -0.05 1 -6.7e-16 2.9
##
## PC1 PC2 PC3 PC4 PC5 PC6
## SS loadings 2.43 2.07 0.44 0.39 0.38 0.29
## Proportion Var 0.41 0.34 0.07 0.06 0.06 0.05
## Cumulative Var 0.41 0.75 0.82 0.89 0.95 1.00
## Proportion Explained 0.41 0.34 0.07 0.06 0.06 0.05
## Cumulative Proportion 0.41 0.75 0.82 0.89 0.95 1.00
##
## Mean item complexity = 3.1
## Test of the hypothesis that 6 components are sufficient.
##
## The root mean square of the residuals (RMSR) is 0
## with the empirical chi square 0 with prob < NA
##
## Fit based upon off diagonal values = 1
cat("Eigen values\n")
## Eigen values
print(pc1$values)
## [1] 2.4307444 2.0698505 0.4391380 0.3869706 0.3803800 0.2929165
Scree Plot
plot(pc1$values, type = "b",xlab = "Factors", ylab = "Eigen values", main = "SCREE PLOT")
From the scree plot, we notice a steep curve before the factor 3, then starts the flat line. We retain those components or factors in the steep curve before the first point that starts the flat line. We notice that 2 of those factors explain most of the variation - 75%.
So we shall use 2 as the number of factors for performing Factor Analysis.
Principal axis factoring
We shall use Principal axis factoring, (fm=“pa”) because we are most interested in identifying the underlying constructs in the data.
The extraction method will produce factor loadings for every item in every extracted factor.
Now, We will use fa() function from the psych package, which received the following primary arguments:
. r: the correlation matrix
. nfactors: number of factors to be extracted (default 1)
. rotate: one of several matrix rotation methods, such as “varimax” or “oblimin” or “none”
. fm: one of several factoring methodsm such as pa (principal axis) or ml (maximum likelihood)
PainReliefMatrix <- cor(painReliefData)
solution <- fa(r=PainReliefMatrix, nfactors = 2, rotate = "none", fm = "pa")
print(solution)
## Factor Analysis using method = pa
## Call: fa(r = PainReliefMatrix, nfactors = 2, rotate = "none", fm = "pa")
## Standardized loadings (pattern matrix) based upon correlation matrix
## PA1 PA2 h2 u2 com
## does.not.upset.stomach 0.60 -0.46 0.57 0.43 1.9
## no.bad.side.effects 0.55 -0.60 0.65 0.35 2.0
## stops.the.pain 0.67 0.46 0.65 0.35 1.8
## works.quickly 0.56 0.57 0.63 0.37 2.0
## keeps.me.awake -0.50 0.62 0.64 0.36 1.9
## limited.relief -0.63 -0.46 0.61 0.39 1.8
##
## PA1 PA2
## SS loadings 2.06 1.70
## Proportion Var 0.34 0.28
## Cumulative Var 0.34 0.63
## Proportion Explained 0.55 0.45
## Cumulative Proportion 0.55 1.00
##
## Mean item complexity = 1.9
## Test of the hypothesis that 2 factors are sufficient.
##
## The degrees of freedom for the null model are 15 and the objective function was 2.35
## The degrees of freedom for the model are 4 and the objective function was 0.03
##
## The root mean square of the residuals (RMSR) is 0.02
## The df corrected root mean square of the residuals is 0.03
##
## Fit based upon off diagonal values = 1
## Measures of factor score adequacy
## PA1 PA2
## Correlation of (regression) scores with factors 0.92 0.91
## Multiple R square of scores with factors 0.85 0.82
## Minimum correlation of possible factor scores 0.69 0.64
print(solution$loadings)
##
## Loadings:
## PA1 PA2
## does.not.upset.stomach 0.596 -0.462
## no.bad.side.effects 0.546 -0.596
## stops.the.pain 0.665 0.457
## works.quickly 0.559 0.568
## keeps.me.awake -0.499 0.623
## limited.relief -0.631 -0.459
##
## PA1 PA2
## SS loadings 2.055 1.699
## Proportion Var 0.343 0.283
## Cumulative Var 0.343 0.626
fa.diagram(solution)
We observe that the components loadings are not clear. Once an initial solution is obtained, the loadings are rotated. Factor Rotation is used to increase interpretability.
Rotation is a way of maximizing high loadings and minimizing low loadings so that the simplest possible structure is achieved.
There are two types of rotation method, orthogonal and oblique rotation.
In orthogonal rotation the rotated factors will remain uncorrelated whereas in oblique rotation the resulting factors will be correlated.
There are a number of different methods of rotation of each type.
The most common orthogonal method is called varimax rotation.
Assuming that there is no correlation between the extracted factors, we will carry out a varimax rotation.
#Orthogonal varmax rotation
PainReliefMatrix <- cor(painReliefData)
solution1 <- fa(r=PainReliefMatrix, nfactors = 2, rotate = "varimax", fm = "pa")
print(solution1)
## Factor Analysis using method = pa
## Call: fa(r = PainReliefMatrix, nfactors = 2, rotate = "varimax", fm = "pa")
## Standardized loadings (pattern matrix) based upon correlation matrix
## PA1 PA2 h2 u2 com
## does.not.upset.stomach 0.14 0.74 0.57 0.43 1.1
## no.bad.side.effects 0.01 0.81 0.65 0.35 1.0
## stops.the.pain 0.80 0.10 0.65 0.35 1.0
## works.quickly 0.79 -0.05 0.63 0.37 1.0
## keeps.me.awake 0.04 -0.80 0.64 0.36 1.0
## limited.relief -0.78 -0.08 0.61 0.39 1.0
##
## PA1 PA2
## SS loadings 1.90 1.86
## Proportion Var 0.32 0.31
## Cumulative Var 0.32 0.63
## Proportion Explained 0.51 0.49
## Cumulative Proportion 0.51 1.00
##
## Mean item complexity = 1
## Test of the hypothesis that 2 factors are sufficient.
##
## The degrees of freedom for the null model are 15 and the objective function was 2.35
## The degrees of freedom for the model are 4 and the objective function was 0.03
##
## The root mean square of the residuals (RMSR) is 0.02
## The df corrected root mean square of the residuals is 0.03
##
## Fit based upon off diagonal values = 1
## Measures of factor score adequacy
## PA1 PA2
## Correlation of (regression) scores with factors 0.91 0.91
## Multiple R square of scores with factors 0.84 0.83
## Minimum correlation of possible factor scores 0.67 0.66
print(solution1$loadings)
##
## Loadings:
## PA1 PA2
## does.not.upset.stomach 0.137 0.742
## no.bad.side.effects 0.808
## stops.the.pain 0.801 0.102
## works.quickly 0.795
## keeps.me.awake -0.797
## limited.relief -0.777
##
## PA1 PA2
## SS loadings 1.897 1.857
## Proportion Var 0.316 0.309
## Cumulative Var 0.316 0.626
From the above output we can see that on applying varimax rotation, the component loadings are very clear.
fa.diagram(solution1)
The square boxes are the observed variables, and the ovals are the unobserved factors. The straight arrows are the loadings, the correlation between the factor and the observed variable(s).
Identify which item belong in which factor
Factor loading can be classified based on their magnitude.
* 0.50 or more - Practically significant
* 0.40 to 0.49 - More important
* 0.30 to 0.39 - Minimum consideration level
Items belonging to factor - PA1:
1. Stops the pain
2. Works quickly
3. Provides Limited Relief
Items belonging to factor - PA2:
1. Does not upset stomach
2. No bad side effects
3. Keeps me awake
Drop iteams as necessary and repeat the rotation and identification steps
Name and define factors
The variables that load highly on factor - PA1:
1. Stops the pain
2. Works quickly
3. Provides Limited Relief
All these items are related to the action of pain reliever, so we can label this factor as Pain reliever Action.
The variables that load highly on factor - PA2:
1. Does not upset stomach
2. No bad side effects
3. Keeps me awake
All these items are related to the side effects of pain reliever, so we can label this factor as Pain reliever side effects.
Oblique Rotation (oblimin): We will use oblique rotation, which recognizes that there is likely to be some correlation between pain relief factors in the real world.
#Oblique rotation (oblimin)
PainReliefMatrix <- cor(painReliefData)
solution2 <- fa(r=PainReliefMatrix, nfactors = 2, rotate = "oblimin", fm = "pa")
print(solution2)
## Factor Analysis using method = pa
## Call: fa(r = PainReliefMatrix, nfactors = 2, rotate = "oblimin", fm = "pa")
## Standardized loadings (pattern matrix) based upon correlation matrix
## PA1 PA2 h2 u2 com
## does.not.upset.stomach 0.11 0.74 0.57 0.43 1
## no.bad.side.effects -0.02 0.81 0.65 0.35 1
## stops.the.pain 0.80 0.06 0.65 0.35 1
## works.quickly 0.80 -0.09 0.63 0.37 1
## keeps.me.awake 0.07 -0.80 0.64 0.36 1
## limited.relief -0.78 -0.04 0.61 0.39 1
##
## PA1 PA2
## SS loadings 1.90 1.85
## Proportion Var 0.32 0.31
## Cumulative Var 0.32 0.63
## Proportion Explained 0.51 0.49
## Cumulative Proportion 0.51 1.00
##
## With factor correlations of
## PA1 PA2
## PA1 1.00 0.08
## PA2 0.08 1.00
##
## Mean item complexity = 1
## Test of the hypothesis that 2 factors are sufficient.
##
## The degrees of freedom for the null model are 15 and the objective function was 2.35
## The degrees of freedom for the model are 4 and the objective function was 0.03
##
## The root mean square of the residuals (RMSR) is 0.02
## The df corrected root mean square of the residuals is 0.03
##
## Fit based upon off diagonal values = 1
## Measures of factor score adequacy
## PA1 PA2
## Correlation of (regression) scores with factors 0.92 0.91
## Multiple R square of scores with factors 0.84 0.83
## Minimum correlation of possible factor scores 0.68 0.66
print(solution2$loadings)
##
## Loadings:
## PA1 PA2
## does.not.upset.stomach 0.114 0.736
## no.bad.side.effects 0.809
## stops.the.pain 0.800
## works.quickly 0.799
## keeps.me.awake -0.801
## limited.relief -0.777
##
## PA1 PA2
## SS loadings 1.899 1.852
## Proportion Var 0.317 0.309
## Cumulative Var 0.317 0.625
From the above output we can conclude that oblimin rotation also produce the same factor loading as varimax rotation.
Examine correlations among factors
We notice that our factors are least correlated at 8% and recall our choice of oblique rotation allowed for the recognition of good relationship; hence our assumption of correlation between components while running oblique rotation is violated.
Analyze internal reliability
Crombach’s alpha is a measure of internal consistency, that is, how closely related a set of items are as a group.
Cronbach’s alpha is computed by correating the score for each scale item with the total score for each observation (usually individual survey responds or test takers), and then comparing that to the variance for all individual item scores. Cronbach’s alpha is a function of
* number of items in a test
* average covariance between pairs of items
* variance of the total score
cronbach(painReliefData)
## $sample.size
## [1] 100
##
## $number.of.items
## [1] 6
##
## $alpha
## [1] -0.6916936
The alpha coefficient of reliability ranges from 0 to 1 in providing this overall assessment of a measure’s reliability.
If all of the scale items are entirely independent from one another (i.e., are not correlated or share no covariance), the alpha = 0; and,
If all of the items have high covariances, then alpha will approach 1 as more the items in the scale approaches infinity.
A good alpha coefficient depends on your theoretical knowledge of the scale in question. Many methodologies recommend a minimum alpha coefficient between 0.65 and 0.80; alpha coefficients less than 0.50 are usually unacceptable.
Overall assessment of this measure’s reliability is good since the absolute value of the alpha coefficient is above 0.65.
Orthogonal Rotation (varimax)
fa.diagram(solution1)
Oblique Rotation (oblimin)
fa.diagram(solution2)
Observation
* The graphs also confirm our interpretation that the factor 1 heavily loads on
+ stops the pain
+ works quickly
+ provides limited relief