What is Factor Analysis?
See: https://www.youtube.com/watch?v=WV_jcaDBZ2I
The purpose of factor analysis is to estimate a model which explains variance/covariance between a set of observed variables (in a population) bya a set of (fewer) unobserved factors + weightings.
To better understand that let´s suppose you have data on people admitted to psychiatric care (not the population, just a sample, within in which there is a degree of variance/covariance between this set of variables) about whether or not they experience:
Let´s suppose there is a covariance between (Insominia & Suicidal thoughts) of about 0.3. The ideia is try to come up with a model that explains that covariance in the population.
The way in which Factor analysis works is that we suppose that the variance and covariance structure in our observed characteristics is in part at least due to some unobserved FACTORS, for example in our case they might be:
Perhaps these two underlying FACTORS which are responsable for the variance/covariance between these VARIABLES.
The idea is that these two FACTORS actually cause the variance/covariance between these observed VARIABLES.
There is a WEIGHTING of DEPRESSION on each of the observed VARIABLES. As weel as a WEIGHTING of ANXIETY ON each of the observed VARIABLES. To the extent that these two particular unobserved FACTORS actually have an causal effect on each of theses observerd VARIABLES.
Typically the WEIGHTING of the FACTORS (unobserved characteristics) have on the VARIABLES (observed characteristics) is different.
So W11 might the the weight of Depression on Insomnia
And w22 might be the weight of Anxiety on Insomnia
and so on ans so forth.
So, there are a set of weights and a set of unobserved factors. And what we are trying to do is to estimate these weights and theses unobserved factors.
Observation: These unobserved factors can themselves be correlated. Not only that, those unobserved factors can also be caused down by the chain by other unobserved factors.
Regarding Each Variable
We can thing about the variance of one variable. Insominia for example. We are trying to explain the variance in insominia in population and try to estimate it using our sample of data.
What we do is that we suppose that there is a proportion of insominia which is due to these Shared Unobserved factors, and we call this variance COMMUNALITY because it is the proportion of varaince which is explaned by a set of factors which are common to the other observed variables.
But we also suppose that there is a proportion of insominia which is not explained by these unobserved factors and this is something we call unique variance of that particular observed variable in the sense that it is unique to that specific observed variable and it´s not coused by the common set of factors.
Typically we also suppose that there is a set of unobserved variables (e1, e2, e3, e4) which themselves explain this unique variance of that particular variable. And if these factors E1, E2, E3, E4 and theselves correlated then we can also think about there being a proportion of covariance which is due to the shared factors and also a proportion of covariance that which is due to these unique variables.
What are the uses of Factor Analysis
The current default method for evaluating and validating a psychometric instrumetn involves performing a Factor Analysis of its items to determine whter they present latent dimensions that are in agreeement with theorized constructs.
x = read.csv("factoranalysis.csv")
x = x[,-1]
You can check which variables have high correlation amongst each other.
summary(x)
## Q17A Q17B Q17C Q17D
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:3.000 1st Qu.:3.000 1st Qu.:3.000 1st Qu.:3.000
## Median :3.000 Median :3.000 Median :4.000 Median :4.000
## Mean :3.434 Mean :3.505 Mean :3.601 Mean :3.984
## 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:5.000
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000
## Q17E Q17F Q17G
## Min. :1.000 Min. :1.0 Min. :1.000
## 1st Qu.:3.000 1st Qu.:4.0 1st Qu.:4.000
## Median :3.000 Median :4.0 Median :4.000
## Mean :3.354 Mean :4.1 Mean :4.193
## 3rd Qu.:4.000 3rd Qu.:5.0 3rd Qu.:5.000
## Max. :5.000 Max. :5.0 Max. :5.000
cor(x) # Correlation Matrix
## Q17A Q17B Q17C Q17D Q17E Q17F Q17G
## Q17A 1.0000000 0.4370851 0.6557886 0.4169181 0.6395923 0.3588108 0.3766165
## Q17B 0.4370851 1.0000000 0.5830892 0.4283782 0.4461499 0.2996508 0.1927364
## Q17C 0.6557886 0.5830892 1.0000000 0.4607460 0.6067006 0.3795506 0.3842428
## Q17D 0.4169181 0.4283782 0.4607460 1.0000000 0.4442956 0.3690653 0.3288365
## Q17E 0.6395923 0.4461499 0.6067006 0.4442956 1.0000000 0.3547855 0.2932414
## Q17F 0.3588108 0.2996508 0.3795506 0.3690653 0.3547855 1.0000000 0.6182730
## Q17G 0.3766165 0.1927364 0.3842428 0.3288365 0.2932414 0.6182730 1.0000000
KMO should be the first test applied to the variables you want to study.
If it´s value is lower then 0.6 then you sould inspect those variables to decide which one should not take part of the factor analysis. To do so drop the indicator variables with the lowest individual KMO statistic values, until KMO overall rises above .60.
In statistics what does KMO statistic mean?
Measured by the Kaiser-Meyer-Olkin (KMO) statistics, sampling adequacy predicts if data are likely to factor well, based on correlation and partial correlation.
In the old days of manual factor analysis, this was extremely useful.
KMO can still be used, however, to assess which variables to drop from the model because they are too multicollinear.
There is a KMO statistic for each individual variable, and their sum is the KMO overall statistic. KMO varies from 0 to 1.0 and KMO overall should be .60 or higher to proceed with factor analysis.
If it is not, drop the indicator variables with the lowest individual KMO statistic values, until KMO overall rises above .60.
The concept is that the partial correlations should not be very large if one is to expect distinct factors to emerge from factor analysis.
QUEST <- data.frame(
Q1=c(1,5,2,3,4,2,3,4,3,2),
Q2=c(2,4,1,2,4,1,2,5,2,1),
Q3=c(2,5,1,3,3,2,2,4,2,2))
#install.packages("psych")
library(psych)
KMO(QUEST)
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = QUEST)
## Overall MSA = 0.77
## MSA for each item =
## Q1 Q2 Q3
## 0.78 0.79 0.74
Source: http://www.instantr.com/2012/12/12/performing-bartletts-test-in-r/
Bartlett's test
Bartlett's test allows you to compare the variance of two or more samples to determine whether they are drawn from populations with equal variance.
It is suitable for normally distributed data. The test has the null hypothesis that the variances are equal and the alterntive hypothesis that they are not equal.
The test can be done only on numeric values (no strings).
This test is useful for checking the assumptions of an analysis of variance.
H0: Samples have equal variance.
H1: t lesar one sample has a significantly different variance.
p-value <= 0.05 reject the null hypothesis.
p-value > 0.05 fail to reject the null hypothesis.
Stacked variables
You can perform Bartlett’s test with the bartlett.test function. If your data is in stacked form (with the values for both samples stored in one variable), use the command:
bartlett.test(values~groups, dataset)
where values is the name of the variable containing the data values and groups is the name of the variable that specifies which sample each value belongs too.
Non Stacked variables
If your data is in unstacked form (with the samples stored in separate variables) nest the variable names inside the list function as shown below.
bartlett.test(list(dataset$sample1, dataset$sample2, dataset$sample3))
If you are unsure whether your data is in stacked or unstacked form, see the article Stacking a dataset in R for examples of data in both forms.
Consider the PlantGrowth dataset (included with R), which gives the dried weight of three groups of ten batches of plants, where each group of ten batches received a different treatment. The weight variable gives the weight of the batch and the groups variable gives the treatment received (either ctrl, trt1 or trt2). To view more information about the dataset, enter help(PlantGrowth). To view the data, enter the dataset name:
head(PlantGrowth)
## weight group
## 1 4.17 ctrl
## 2 5.58 ctrl
## 3 5.18 ctrl
## 4 6.11 ctrl
## 5 4.50 ctrl
## 6 4.61 ctrl
Suppose you want to use Bartlett’s test to determine whether the the variance in weight is the same for all treatment groups. A significance level of 0.05 will be used.
To perform the test, use the command (This gives the output):
bartlett.test(weight~group, PlantGrowth)
##
## Bartlett test of homogeneity of variances
##
## data: weight by group
## Bartlett's K-squared = 2.8786, df = 2, p-value = 0.2371
From the output we can see that the p-value of 0.2371 is not less than the significance level of 0.05.
This means we cannot reject the null hypothesis that the variance is the same for all treatment groups.
This means that there is no evidence to suggest that the variance in plant growth is different for the three treatment groups.
Quote Tadeu: O barret test indica se a matriz de correlação é adequada para um fatorial e o valor do p-value deve ser diferente de zero (0).
head(PlantGrowth)
## weight group
## 1 4.17 ctrl
## 2 5.58 ctrl
## 3 5.18 ctrl
## 4 6.11 ctrl
## 5 4.50 ctrl
## 6 4.61 ctrl
table(PlantGrowth$group)
##
## ctrl trt1 trt2
## 10 10 10
datasetA = subset(PlantGrowth, group=="ctrl")
datasetB = subset(PlantGrowth, group=="trt1")
datasetC = subset(PlantGrowth, group=="trt2")
dataset = data.frame(datasetA[,1],datasetB[,1],datasetC[,1])
names(dataset) = c("A","B","C")
dataset
## A B C
## 1 4.17 4.81 6.31
## 2 5.58 4.17 5.12
## 3 5.18 4.41 5.54
## 4 6.11 3.59 5.50
## 5 4.50 5.87 5.37
## 6 4.61 3.83 5.29
## 7 5.17 6.03 4.92
## 8 4.53 4.89 6.15
## 9 5.33 4.32 5.80
## 10 5.14 4.69 5.26
bartlett.test(dataset)
##
## Bartlett test of homogeneity of variances
##
## data: dataset
## Bartlett's K-squared = 2.8786, df = 2, p-value = 0.2371
e
Chosing the number of factors
The number of components is also the number of variables.
Proportion of variance: Explains the proportin of variance. eg. If it equals to .85 then it explains 85% of the variance.
Cumulative Proportion: Cumulates the porportion of the previous components adding up until sums up 1.
Standard deviation = eigenvalues
The question here is: How many of theses components (factors) we want to retain?
Choose the number of components (factors) with eigenvalues equal or higher then 1 using the Principal Component function.
In this example below it should be 2 because Comp.1 and Comp.2 are the only ones with eigenvalues (1.904, 1.04).
pca1 = princomp(x, scores=TRUE, cor=TRUE)
summary(pca1)
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4
## Standard deviation 1.9036937 1.0423367 0.81837919 0.75632747
## Proportion of Variance 0.5177214 0.1552094 0.09567779 0.08171875
## Cumulative Proportion 0.5177214 0.6729308 0.76860854 0.85032729
## Comp.5 Comp.6 Comp.7
## Standard deviation 0.64958592 0.56978592 0.54871770
## Proportion of Variance 0.06028027 0.04637943 0.04301302
## Cumulative Proportion 0.91060756 0.95698698 1.00000000
Step 1 - Analysing Loadings
The loadings from the above pca will allow you to choose which variables go into which components.
To make this choise you should do the following.
Notice that R uses a cut off of .1 so some loadings and blank.
These numbers represent the correlation between the component and the variable.
loadings(pca1)
##
## Loadings:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
## Q17A -0.416 0.174 -0.465 0.133 0.222 0.613 0.370
## Q17B -0.356 0.347 0.452 -0.631 0.378
## Q17C -0.436 0.212 -0.136 -0.203 0.351 -0.759
## Q17D -0.357 0.647 0.657 0.126
## Q17E -0.405 0.243 -0.345 0.264 -0.564 -0.512
## Q17F -0.341 -0.573 0.106 -0.194 -0.549 0.391 -0.229
## Q17G -0.320 -0.644 -0.110 0.437 -0.439 0.290
##
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
## SS loadings 1.000 1.000 1.000 1.000 1.000 1.000 1.000
## Proportion Var 0.143 0.143 0.143 0.143 0.143 0.143 0.143
## Cumulative Var 0.143 0.286 0.429 0.571 0.714 0.857 1.000
#pca1$loadings
Step 2 - Using rotation to confirm the loadings resultus
You can confirm the results of the PCA$Loading applying the factnal funciton to the same data.
It returns the same components in this case called factors.
Each factor representing the variables loadings.
You can run factanal() with factors ranging from the maximum allowed down until you find the first ocurrence of p-value <.05. That should set the number of factors.
Here we do not talk about what percentage of the variation is explained .
we talk about the uniqueness of the factors.
As a result of this command you can also see the factor LOADINGS.
Which is simmilar to the principal component loadings (slightly different calculation) .
Similarly to the PCA Loadings you should put variables intro de Factor that represents the highest load of this variable.
Rotation is stronger the Loadings of PCA, So if they disagree you should stick to the Rotation results.
# Using 2 factors according to Methods 1 and 2
fa1 = factanal(x, factor=2, rotation="varimax", scores="regression")
fa1
##
## Call:
## factanal(x = x, factors = 2, scores = "regression", rotation = "varimax")
##
## Uniquenesses:
## Q17A Q17B Q17C Q17D Q17E Q17F Q17G
## 0.393 0.575 0.305 0.660 0.425 0.565 0.005
##
## Loadings:
## Factor1 Factor2
## Q17A 0.730 0.272
## Q17B 0.645
## Q17C 0.789 0.270
## Q17D 0.524 0.254
## Q17E 0.735 0.187
## Q17F 0.315 0.579
## Q17G 0.149 0.986
##
## Factor1 Factor2
## SS loadings 2.506 1.565
## Proportion Var 0.358 0.224
## Cumulative Var 0.358 0.582
##
## Test of the hypothesis that 2 factors are sufficient.
## The chi square statistic is 29.36 on 8 degrees of freedom.
## The p-value is 0.000274
The example above deals with 2 factors. Howerver in some cases you can deal with a lot of factors making it difficult to visually choose which variable goes into which factor.
In those cases you should store the absolute values of the loadings in a CSV file to find which absolute loading values of a variable is the highest amongst factors so you can decide to which factor assign a particular variable.
To do so first create the CSV file with the absolute values of the loadings.
write.csv(abs(fa1$loadings), "loadings.csv")
From now on you should use Excel.
First create a column called MAXIMUM. You should store in this columns the maximum value of the values of a line. Remember that in one line of the loadings are all the loadings of a variable but you should stick to the highets absolute value of them all.
EXCEL COMMAND: =MÁXIMO(<
In doing so you can visually find which variable belongs to which factor without the burden of picking visually dozens of possible combinations.
For a smaller number of factors you can easily do this visually but this is a good strategy otherwise.
This is an almighty result and diserves a good explanation, so let´s go to it.
The pscqq$scores below returns a table of loads for each factor.
This result can be used to express in one value the load of all the variables together in this particular component (factor).
Observe that the calculation of scores is done to all possible factors (Comp.1 up to Comp.7).
The point here is that the values in Comp.1 refer to all observatons from variables Q17A to Q17G put together.
Simillarly Comp.2 refers to all observations from variables Q17A to Q17G with one important difference.
Variable Comp.1 expresses more strongly the loadings of the variables Q17A to Q17E and Comp.2 expresses more strongly the loadings of the variables Q17F to Q17G. The loadings of Q17F to Q17G on Comp.1 are non expressive as well as the loading from Q17A to Q17E on Comp.2 are non expressive.
In other words we can use Comp.1 and Comp.2 in our example as columns in the data set representing a unique value taht expresses the strength of Q17A to Q17E put together and Q17F to Q17G put togheter.
We can work from now on comparing the means of these columns.
You can convert the values of the scores to values ranging from 0 to 1 using the following principle. The minimum value is equal to zero the maximum value is equal to 100% and do the calculations to the others.
However, the Generates the components you can use instead of the original variables
So you can see an specifice observation and how it´s score on one particular component
I still can´t see the use of it
pca1$scores[1:10,]
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## [1,] -1.1264223 -1.01152651 -0.78296004 0.2864407 -0.26385391
## [2,] 0.4468826 -0.07196384 0.03190024 0.1700661 0.25391424
## [3,] 2.6800626 -1.93669278 0.63651709 -0.8806210 0.76301457
## [4,] -1.2513165 1.90617530 1.17988174 -0.2394527 0.73989863
## [5,] 1.6640059 0.35102689 -0.44206990 -0.3033111 -0.73510756
## [6,] 0.5063145 -1.16292192 -0.60171945 0.8285846 0.77298345
## [7,] -2.4132238 -0.21573872 -0.88050838 -0.5149585 0.28587756
## [8,] 1.7808478 0.06778094 1.43965756 1.7636303 -1.99360486
## [9,] 0.9105330 -0.29731034 0.17712330 0.3859131 -0.11955035
## [10,] -2.4235099 -0.71102632 -1.52424240 1.9423885 -0.05223305
## Comp.6 Comp.7
## [1,] 0.04499774 -0.25766222
## [2,] -0.02859984 -0.76673093
## [3,] 0.03417586 0.63706208
## [4,] -1.19119749 -0.38524923
## [5,] 0.46370033 -0.24504195
## [6,] -0.41914730 -0.88391369
## [7,] 0.48688043 -0.26260782
## [8,] -0.35092511 1.03224242
## [9,] 0.06118465 0.04058196
## [10,] 0.20377837 -1.06108664
What is Cronbach’s Alpha?
How’s it measure the realiability of data?
How to interpret the output?
Used to know if the dataset can be used in a factorial analysis. Measurs if a dataset if consistent.
Identificar dentro de um fator as variáveis que ficam e as que saem.
It’s better to think of Cronbach Alpha as measuring the “consistency” of your measure instead of “reliability”.
It takes your measure and splits it up to come out with a coefficient that tells you how consistent the items are in your measure are at measuring some certain construct with in the factor (component).
So for instance if you have a test that measures math, spelling, and science, its going to have a low cronbach alpha because the test is measuring 3 different things.
But if the items on your test are all measuring the same thing (i.e., how to multiply) then you will have a high alpha.
When interpreting it you also have to take into account how long your test is - the longer the test, the lower the alpha will be.
So if you have an alpha of .85, then you can say that your test is 85% consistent/reliable in measuring the same construct.
# This is a straigth foward function
QUEST <- data.frame(
Q1=c(1,5,2,3,4,2,3,4,3,2),
Q2=c(2,4,1,2,4,1,2,5,2,1),
Q3=c(2,5,1,3,3,2,2,4,2,2))
#install.packages("fmsb")
library(fmsb)
CronbachAlpha(QUEST)
## [1] 0.9295039
QUEST <- data.frame(
Q1=c(1,5,2,3,4,2,3,4,3,2),
Q2=c(2,4,1,2,4,1,2,5,2,1),
Q3=c(2,5,1,3,3,2,2,4,2,2))
#install.packages("psych")
library(psych)
alpha(QUEST)
##
## Reliability analysis
## Call: alpha(x = QUEST)
##
## raw_alpha std.alpha G6(smc) average_r S/N ase mean sd
## 0.93 0.93 0.91 0.83 14 0.22 2.6 1.2
##
## lower alpha upper 95% confidence boundaries
## 0.5 0.93 1.36
##
## Reliability if an item is dropped:
## raw_alpha std.alpha G6(smc) average_r S/N alpha se
## Q1 0.90 0.91 0.83 0.83 10.1 0.37
## Q2 0.91 0.91 0.84 0.84 10.4 0.37
## Q3 0.88 0.89 0.80 0.80 8.2 0.38
##
## Item statistics
## n raw.r std.r r.cor r.drop mean sd
## Q1 10 0.93 0.94 0.89 0.86 2.9 1.2
## Q2 10 0.94 0.94 0.88 0.85 2.4 1.4
## Q3 10 0.94 0.95 0.91 0.88 2.6 1.2
##
## Non missing response frequency for each item
## 1 2 3 4 5 miss
## Q1 0.1 0.3 0.3 0.2 0.1 0
## Q2 0.3 0.4 0.0 0.2 0.1 0
## Q3 0.1 0.5 0.2 0.1 0.1 0
Plots components in the x axis versus eigenvalues in the y axis.
You can visually inspect how many components to use (factor) choosing those above 1
plot(pca1)
screeplot(pca1, type="line", main="Scree Plot")
biplot(pca1) # I have the slightest idea of what it is about.
No idea of what is it about
#varimax(pca1$rotation)
#promax(pca1$rotation)