Factor Analysis

20180519114700

What is Factor Analysis?

See: https://www.youtube.com/watch?v=WV_jcaDBZ2I

The purpose of factor analysis is to estimate a model which explains variance/covariance between a set of observed variables (in a population) bya a set of (fewer) unobserved factors + weightings.

To better understand that let´s suppose you have data on people admitted to psychiatric care (not the population, just a sample, within in which there is a degree of variance/covariance between this set of variables) about whether or not they experience:

  • Insominia
  • Suicidal thoughts
  • Hyperventilation
  • Nausea

Let´s suppose there is a covariance between (Insominia & Suicidal thoughts) of about 0.3. The ideia is try to come up with a model that explains that covariance in the population.

The way in which Factor analysis works is that we suppose that the variance and covariance structure in our observed characteristics is in part at least due to some unobserved FACTORS, for example in our case they might be:

  • Depression: Whether the individual is depressed or not
  • Anxiety: Whether the individual is experiencing some sort of extreme Anxiety

Perhaps these two underlying FACTORS which are responsable for the variance/covariance between these VARIABLES.

The idea is that these two FACTORS actually cause the variance/covariance between these observed VARIABLES.

There is a WEIGHTING of DEPRESSION on each of the observed VARIABLES. As weel as a WEIGHTING of ANXIETY ON each of the observed VARIABLES. To the extent that these two particular unobserved FACTORS actually have an causal effect on each of theses observerd VARIABLES.

Typically the WEIGHTING of the FACTORS (unobserved characteristics) have on the VARIABLES (observed characteristics) is different.

So W11 might the the weight of Depression on Insomnia
And w22 might be the weight of Anxiety on Insomnia
and so on ans so forth.

So, there are a set of weights and a set of unobserved factors. And what we are trying to do is to estimate these weights and theses unobserved factors.

Observation: These unobserved factors can themselves be correlated. Not only that, those unobserved factors can also be caused down by the chain by other unobserved factors.

Regarding Each Variable

We can thing about the variance of one variable. Insominia for example. We are trying to explain the variance in insominia in population and try to estimate it using our sample of data.

What we do is that we suppose that there is a proportion of insominia which is due to these Shared Unobserved factors, and we call this variance COMMUNALITY because it is the proportion of varaince which is explaned by a set of factors which are common to the other observed variables.

But we also suppose that there is a proportion of insominia which is not explained by these unobserved factors and this is something we call unique variance of that particular observed variable in the sense that it is unique to that specific observed variable and it´s not coused by the common set of factors.

Typically we also suppose that there is a set of unobserved variables (e1, e2, e3, e4) which themselves explain this unique variance of that particular variable. And if these factors E1, E2, E3, E4 and theselves correlated then we can also think about there being a proportion of covariance which is due to the shared factors and also a proportion of covariance that which is due to these unique variables.

What are the uses of Factor Analysis

  1. Try to explain the variance and covariacne between a set of observed characteristics in terms of typically a simpler structure. (In our example above we have 4 observed characteristics and we are trying to explain the variaance/covariance between this variables in terms of two unobsered factors, so we went from a system of dimensionality four to a system of dimensinality two);
  2. Testing a particular theory. (For example there might be a theoretical evidence which links depression and anxiety to each of these four characteristics and we can actually test whether it is likely the case that our theory holds up when we comparte it to the data);
  3. Dimensionality reduction: We started with a system with 4 variables and we are trying to estimate the variance/covariance between theses variables in terms of a system which is of lower dimensionality. STarting off with a system of 4 and now we replaced with a system of two unobserved factors. Improving the predictive power of a model.

The current default method for evaluating and validating a psychometric instrumetn involves performing a Factor Analysis of its items to determine whter they present latent dimensions that are in agreeement with theorized constructs.

20180519114701

Step by step Factor Analysis?

https://www.youtube.com/watch?v=Od8gfNOOS9o

Load Variables

        x = read.csv("factoranalysis.csv")
        x = x[,-1]

Discriptive Statistics

You can check which variables have high correlation amongst each other.

    summary(x)
##       Q17A            Q17B            Q17C            Q17D      
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:3.000   1st Qu.:3.000   1st Qu.:3.000   1st Qu.:3.000  
##  Median :3.000   Median :3.000   Median :4.000   Median :4.000  
##  Mean   :3.434   Mean   :3.505   Mean   :3.601   Mean   :3.984  
##  3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:5.000  
##  Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
##       Q17E            Q17F          Q17G      
##  Min.   :1.000   Min.   :1.0   Min.   :1.000  
##  1st Qu.:3.000   1st Qu.:4.0   1st Qu.:4.000  
##  Median :3.000   Median :4.0   Median :4.000  
##  Mean   :3.354   Mean   :4.1   Mean   :4.193  
##  3rd Qu.:4.000   3rd Qu.:5.0   3rd Qu.:5.000  
##  Max.   :5.000   Max.   :5.0   Max.   :5.000
    cor(x) # Correlation Matrix
##           Q17A      Q17B      Q17C      Q17D      Q17E      Q17F      Q17G
## Q17A 1.0000000 0.4370851 0.6557886 0.4169181 0.6395923 0.3588108 0.3766165
## Q17B 0.4370851 1.0000000 0.5830892 0.4283782 0.4461499 0.2996508 0.1927364
## Q17C 0.6557886 0.5830892 1.0000000 0.4607460 0.6067006 0.3795506 0.3842428
## Q17D 0.4169181 0.4283782 0.4607460 1.0000000 0.4442956 0.3690653 0.3288365
## Q17E 0.6395923 0.4461499 0.6067006 0.4442956 1.0000000 0.3547855 0.2932414
## Q17F 0.3588108 0.2996508 0.3795506 0.3690653 0.3547855 1.0000000 0.6182730
## Q17G 0.3766165 0.1927364 0.3842428 0.3288365 0.2932414 0.6182730 1.0000000

Kaiser-Meyer-Oklin Test (KMO)

KMO should be the first test applied to the variables you want to study.

If it´s value is lower then 0.6 then you sould inspect those variables to decide which one should not take part of the factor analysis. To do so drop the indicator variables with the lowest individual KMO statistic values, until KMO overall rises above .60.

In statistics what does KMO statistic mean?

Measured by the Kaiser-Meyer-Olkin (KMO) statistics, sampling adequacy predicts if data are likely to factor well, based on correlation and partial correlation.

In the old days of manual factor analysis, this was extremely useful.

KMO can still be used, however, to assess which variables to drop from the model because they are too multicollinear.

There is a KMO statistic for each individual variable, and their sum is the KMO overall statistic. KMO varies from 0 to 1.0 and KMO overall should be .60 or higher to proceed with factor analysis.

If it is not, drop the indicator variables with the lowest individual KMO statistic values, until KMO overall rises above .60.

The concept is that the partial correlations should not be very large if one is to expect distinct factors to emerge from factor analysis.

QUEST <- data.frame(
Q1=c(1,5,2,3,4,2,3,4,3,2), 
Q2=c(2,4,1,2,4,1,2,5,2,1), 
Q3=c(2,5,1,3,3,2,2,4,2,2))
#install.packages("psych")
library(psych)
KMO(QUEST)
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = QUEST)
## Overall MSA =  0.77
## MSA for each item = 
##   Q1   Q2   Q3 
## 0.78 0.79 0.74

Bartlett’s test (Homogeneity of Variances)

Source: http://www.instantr.com/2012/12/12/performing-bartletts-test-in-r/

    Bartlett's test  
    
    Bartlett's test allows you to compare the variance of two or more samples to determine whether they are drawn from populations with equal variance.  
    
    It is suitable for normally distributed data. The test has the null hypothesis that the variances are equal and the alterntive hypothesis that they are not equal.  
    
    The test can be done only on numeric values (no strings).  
    
    This test is useful for checking the assumptions of an analysis of variance.  
    
    H0: Samples have equal variance.  
    H1: t lesar one sample has a significantly different variance.
    
    p-value <= 0.05 reject the null hypothesis.  
    p-value > 0.05 fail to reject the null hypothesis.  
    

Stacked variables

You can perform Bartlett’s test with the bartlett.test function. If your data is in stacked form (with the values for both samples stored in one variable), use the command:

    bartlett.test(values~groups, dataset)
    

where values is the name of the variable containing the data values and groups is the name of the variable that specifies which sample each value belongs too.

Non Stacked variables

If your data is in unstacked form (with the samples stored in separate variables) nest the variable names inside the list function as shown below.

    bartlett.test(list(dataset$sample1, dataset$sample2, dataset$sample3))
    

If you are unsure whether your data is in stacked or unstacked form, see the article Stacking a dataset in R for examples of data in both forms.

Bartlett’s test using the PlantGrowth data (stacked)

Consider the PlantGrowth dataset (included with R), which gives the dried weight of three groups of ten batches of plants, where each group of ten batches received a different treatment. The weight variable gives the weight of the batch and the groups variable gives the treatment received (either ctrl, trt1 or trt2). To view more information about the dataset, enter help(PlantGrowth). To view the data, enter the dataset name:

        head(PlantGrowth)
##   weight group
## 1   4.17  ctrl
## 2   5.58  ctrl
## 3   5.18  ctrl
## 4   6.11  ctrl
## 5   4.50  ctrl
## 6   4.61  ctrl

Suppose you want to use Bartlett’s test to determine whether the the variance in weight is the same for all treatment groups. A significance level of 0.05 will be used.

To perform the test, use the command (This gives the output):

        bartlett.test(weight~group, PlantGrowth)
## 
##  Bartlett test of homogeneity of variances
## 
## data:  weight by group
## Bartlett's K-squared = 2.8786, df = 2, p-value = 0.2371

From the output we can see that the p-value of 0.2371 is not less than the significance level of 0.05.

This means we cannot reject the null hypothesis that the variance is the same for all treatment groups.

This means that there is no evidence to suggest that the variance in plant growth is different for the three treatment groups.

Quote Tadeu: O barret test indica se a matriz de correlação é adequada para um fatorial e o valor do p-value deve ser diferente de zero (0).

Bartlett’s test using the PlantGrowth data (non stacked)

head(PlantGrowth)
##   weight group
## 1   4.17  ctrl
## 2   5.58  ctrl
## 3   5.18  ctrl
## 4   6.11  ctrl
## 5   4.50  ctrl
## 6   4.61  ctrl
table(PlantGrowth$group)
## 
## ctrl trt1 trt2 
##   10   10   10
datasetA = subset(PlantGrowth, group=="ctrl")
datasetB = subset(PlantGrowth, group=="trt1")
datasetC = subset(PlantGrowth, group=="trt2")
dataset = data.frame(datasetA[,1],datasetB[,1],datasetC[,1])
names(dataset) = c("A","B","C")
dataset
##       A    B    C
## 1  4.17 4.81 6.31
## 2  5.58 4.17 5.12
## 3  5.18 4.41 5.54
## 4  6.11 3.59 5.50
## 5  4.50 5.87 5.37
## 6  4.61 3.83 5.29
## 7  5.17 6.03 4.92
## 8  4.53 4.89 6.15
## 9  5.33 4.32 5.80
## 10 5.14 4.69 5.26
bartlett.test(dataset)
## 
##  Bartlett test of homogeneity of variances
## 
## data:  dataset
## Bartlett's K-squared = 2.8786, df = 2, p-value = 0.2371

e

Principal component analysis

Chosing the number of factors

The number of components is also the number of variables.

Proportion of variance: Explains the proportin of variance. eg. If it equals to .85 then it explains 85% of the variance.

Cumulative Proportion: Cumulates the porportion of the previous components adding up until sums up 1.

Standard deviation = eigenvalues

The question here is: How many of theses components (factors) we want to retain?

Choose the number of components (factors) with eigenvalues equal or higher then 1 using the Principal Component function.

In this example below it should be 2 because Comp.1 and Comp.2 are the only ones with eigenvalues (1.904, 1.04).

        pca1 = princomp(x, scores=TRUE, cor=TRUE)
        summary(pca1)
## Importance of components:
##                           Comp.1    Comp.2     Comp.3     Comp.4
## Standard deviation     1.9036937 1.0423367 0.81837919 0.75632747
## Proportion of Variance 0.5177214 0.1552094 0.09567779 0.08171875
## Cumulative Proportion  0.5177214 0.6729308 0.76860854 0.85032729
##                            Comp.5     Comp.6     Comp.7
## Standard deviation     0.64958592 0.56978592 0.54871770
## Proportion of Variance 0.06028027 0.04637943 0.04301302
## Cumulative Proportion  0.91060756 0.95698698 1.00000000

Loadings of principal components (cargas fatoriais)

Step 1 - Analysing Loadings

The loadings from the above pca will allow you to choose which variables go into which components.

To make this choise you should do the following.

  1. Ignore the signal of the load, use absolute values.
  2. For each variable observe in which component lies it´s highest load. e.g. In the example below you can observe that from the two components (factors) chosen from the above Q17A is higher in the Comp.1, so that is where it shoud stay, on the other hand Q17G is higher on Comp.2, so it stays there.
  3. Decide which variables go into which components. In our example below, Comp.1(Q17A, Q17B,Q17C,Q17D,Q17E) and Comp.2 (Q17F,Q17G).

Notice that R uses a cut off of .1 so some loadings and blank.

These numbers represent the correlation between the component and the variable.

    loadings(pca1)
## 
## Loadings:
##      Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
## Q17A -0.416  0.174 -0.465  0.133  0.222  0.613  0.370
## Q17B -0.356  0.347  0.452 -0.631                0.378
## Q17C -0.436  0.212 -0.136 -0.203  0.351        -0.759
## Q17D -0.357         0.647  0.657  0.126              
## Q17E -0.405  0.243 -0.345  0.264 -0.564 -0.512       
## Q17F -0.341 -0.573  0.106 -0.194 -0.549  0.391 -0.229
## Q17G -0.320 -0.644 -0.110         0.437 -0.439  0.290
## 
##                Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
## SS loadings     1.000  1.000  1.000  1.000  1.000  1.000  1.000
## Proportion Var  0.143  0.143  0.143  0.143  0.143  0.143  0.143
## Cumulative Var  0.143  0.286  0.429  0.571  0.714  0.857  1.000
    #pca1$loadings

Step 2 - Using rotation to confirm the loadings resultus

You can confirm the results of the PCA$Loading applying the factnal funciton to the same data.

It returns the same components in this case called factors.

Each factor representing the variables loadings.

You can run factanal() with factors ranging from the maximum allowed down until you find the first ocurrence of p-value <.05. That should set the number of factors.

Here we do not talk about what percentage of the variation is explained .

we talk about the uniqueness of the factors.

As a result of this command you can also see the factor LOADINGS.

Which is simmilar to the principal component loadings (slightly different calculation) .

Similarly to the PCA Loadings you should put variables intro de Factor that represents the highest load of this variable.

Rotation is stronger the Loadings of PCA, So if they disagree you should stick to the Rotation results.

    # Using 2 factors according to Methods 1 and 2 

    fa1 = factanal(x, factor=2, rotation="varimax", scores="regression")
    fa1
## 
## Call:
## factanal(x = x, factors = 2, scores = "regression", rotation = "varimax")
## 
## Uniquenesses:
##  Q17A  Q17B  Q17C  Q17D  Q17E  Q17F  Q17G 
## 0.393 0.575 0.305 0.660 0.425 0.565 0.005 
## 
## Loadings:
##      Factor1 Factor2
## Q17A 0.730   0.272  
## Q17B 0.645          
## Q17C 0.789   0.270  
## Q17D 0.524   0.254  
## Q17E 0.735   0.187  
## Q17F 0.315   0.579  
## Q17G 0.149   0.986  
## 
##                Factor1 Factor2
## SS loadings      2.506   1.565
## Proportion Var   0.358   0.224
## Cumulative Var   0.358   0.582
## 
## Test of the hypothesis that 2 factors are sufficient.
## The chi square statistic is 29.36 on 8 degrees of freedom.
## The p-value is 0.000274

The example above deals with 2 factors. Howerver in some cases you can deal with a lot of factors making it difficult to visually choose which variable goes into which factor.

In those cases you should store the absolute values of the loadings in a CSV file to find which absolute loading values of a variable is the highest amongst factors so you can decide to which factor assign a particular variable.

To do so first create the CSV file with the absolute values of the loadings.

    write.csv(abs(fa1$loadings), "loadings.csv")

From now on you should use Excel.

First create a column called MAXIMUM. You should store in this columns the maximum value of the values of a line. Remember that in one line of the loadings are all the loadings of a variable but you should stick to the highets absolute value of them all.

EXCEL COMMAND: =MÁXIMO(<>;<>:<>;)

In doing so you can visually find which variable belongs to which factor without the burden of picking visually dozens of possible combinations.

For a smaller number of factors you can easily do this visually but this is a good strategy otherwise.

Scores of the components

This is an almighty result and diserves a good explanation, so let´s go to it.

The pscqq$scores below returns a table of loads for each factor.

This result can be used to express in one value the load of all the variables together in this particular component (factor).

Observe that the calculation of scores is done to all possible factors (Comp.1 up to Comp.7).

The point here is that the values in Comp.1 refer to all observatons from variables Q17A to Q17G put together.

Simillarly Comp.2 refers to all observations from variables Q17A to Q17G with one important difference.

Variable Comp.1 expresses more strongly the loadings of the variables Q17A to Q17E and Comp.2 expresses more strongly the loadings of the variables Q17F to Q17G. The loadings of Q17F to Q17G on Comp.1 are non expressive as well as the loading from Q17A to Q17E on Comp.2 are non expressive.

In other words we can use Comp.1 and Comp.2 in our example as columns in the data set representing a unique value taht expresses the strength of Q17A to Q17E put together and Q17F to Q17G put togheter.

We can work from now on comparing the means of these columns.

You can convert the values of the scores to values ranging from 0 to 1 using the following principle. The minimum value is equal to zero the maximum value is equal to 100% and do the calculations to the others.

However, the Generates the components you can use instead of the original variables
So you can see an specifice observation and how it´s score on one particular component
I still can´t see the use of it

    pca1$scores[1:10,]
##           Comp.1      Comp.2      Comp.3     Comp.4      Comp.5
##  [1,] -1.1264223 -1.01152651 -0.78296004  0.2864407 -0.26385391
##  [2,]  0.4468826 -0.07196384  0.03190024  0.1700661  0.25391424
##  [3,]  2.6800626 -1.93669278  0.63651709 -0.8806210  0.76301457
##  [4,] -1.2513165  1.90617530  1.17988174 -0.2394527  0.73989863
##  [5,]  1.6640059  0.35102689 -0.44206990 -0.3033111 -0.73510756
##  [6,]  0.5063145 -1.16292192 -0.60171945  0.8285846  0.77298345
##  [7,] -2.4132238 -0.21573872 -0.88050838 -0.5149585  0.28587756
##  [8,]  1.7808478  0.06778094  1.43965756  1.7636303 -1.99360486
##  [9,]  0.9105330 -0.29731034  0.17712330  0.3859131 -0.11955035
## [10,] -2.4235099 -0.71102632 -1.52424240  1.9423885 -0.05223305
##            Comp.6      Comp.7
##  [1,]  0.04499774 -0.25766222
##  [2,] -0.02859984 -0.76673093
##  [3,]  0.03417586  0.63706208
##  [4,] -1.19119749 -0.38524923
##  [5,]  0.46370033 -0.24504195
##  [6,] -0.41914730 -0.88391369
##  [7,]  0.48688043 -0.26260782
##  [8,] -0.35092511  1.03224242
##  [9,]  0.06118465  0.04058196
## [10,]  0.20377837 -1.06108664

Cronbach’s alpha

What is Cronbach’s Alpha?
How’s it measure the realiability of data?
How to interpret the output?

Used to know if the dataset can be used in a factorial analysis. Measurs if a dataset if consistent.

Identificar dentro de um fator as variáveis que ficam e as que saem.

It’s better to think of Cronbach Alpha as measuring the “consistency” of your measure instead of “reliability”.

It takes your measure and splits it up to come out with a coefficient that tells you how consistent the items are in your measure are at measuring some certain construct with in the factor (component).

So for instance if you have a test that measures math, spelling, and science, its going to have a low cronbach alpha because the test is measuring 3 different things.

But if the items on your test are all measuring the same thing (i.e., how to multiply) then you will have a high alpha.

When interpreting it you also have to take into account how long your test is - the longer the test, the lower the alpha will be.

So if you have an alpha of .85, then you can say that your test is 85% consistent/reliable in measuring the same construct.

1. Using Package “fmsb”

# This is a straigth foward function    
QUEST <- data.frame(
Q1=c(1,5,2,3,4,2,3,4,3,2), 
Q2=c(2,4,1,2,4,1,2,5,2,1), 
Q3=c(2,5,1,3,3,2,2,4,2,2))


#install.packages("fmsb")
library(fmsb)
CronbachAlpha(QUEST)
## [1] 0.9295039

2. Using Package “psych”

QUEST <- data.frame(
Q1=c(1,5,2,3,4,2,3,4,3,2), 
Q2=c(2,4,1,2,4,1,2,5,2,1), 
Q3=c(2,5,1,3,3,2,2,4,2,2))
#install.packages("psych")
library(psych)
alpha(QUEST)
## 
## Reliability analysis   
## Call: alpha(x = QUEST)
## 
##   raw_alpha std.alpha G6(smc) average_r S/N  ase mean  sd
##       0.93      0.93    0.91      0.83  14 0.22  2.6 1.2
## 
##  lower alpha upper     95% confidence boundaries
## 0.5 0.93 1.36 
## 
##  Reliability if an item is dropped:
##    raw_alpha std.alpha G6(smc) average_r  S/N alpha se
## Q1      0.90      0.91    0.83      0.83 10.1     0.37
## Q2      0.91      0.91    0.84      0.84 10.4     0.37
## Q3      0.88      0.89    0.80      0.80  8.2     0.38
## 
##  Item statistics 
##     n raw.r std.r r.cor r.drop mean  sd
## Q1 10  0.93  0.94  0.89   0.86  2.9 1.2
## Q2 10  0.94  0.94  0.88   0.85  2.4 1.4
## Q3 10  0.94  0.95  0.91   0.88  2.6 1.2
## 
## Non missing response frequency for each item
##      1   2   3   4   5 miss
## Q1 0.1 0.3 0.3 0.2 0.1    0
## Q2 0.3 0.4 0.0 0.2 0.1    0
## Q3 0.1 0.5 0.2 0.1 0.1    0

NO USE TO THIS UP TO HERE

5. Screen plot of eigenvalues

Plots components in the x axis versus eigenvalues in the y axis.
You can visually inspect how many components to use (factor) choosing those above 1

    plot(pca1)

    screeplot(pca1, type="line", main="Scree Plot") 

6. Biplot of score variables

    biplot(pca1) # I have the slightest idea of what it is about.

8. Rotation

No idea of what is it about

    #varimax(pca1$rotation)
    #promax(pca1$rotation)