Exploratory Factor Analysis

Getting to Know Factor Analysis

Required packages: psych and nFactors

GENERAL PROCEDURE

Set up and evaluate data set
Choose number of factors to extract
Extract (and rotate) factors
Evaluate what you have and possibly repeat 2 & 3
Interpret and write-up results

For more info: http://www.statmethods.net/advstats/factor.html

Also read more of the Field Discovering Statistics using R book

Some extra data preparation help

To omit missing variables, create a new data set without them:

Generally, factor analysis (and principal components analysis) does not play well with missing data. You will have to remove cases that have missing data.

Beware, however, just doing that before selecting your variables of interest from a data set. If you leave all variables in the data set when you remove missing cases, then you will frequently discover that a particularly problematic variable (one with a lot of missings) will cause you to cull a lot of observations that you would be able to use otherwise.

The code for removing missing cases is straightforward, but it is a blunt force tool.

If, in the course of doing a factor analysis, you decide to discard one or more variables, then you are strongly advised to go back and subset the original data set and removing missings from that before continuing.

To remove missings, use: dat <- na.omit(my.Original.data) # note: I rename the data set each time to save work later.

Also, since these examples use the hypothetical data set “mydata”, you may want to go ahead and rename your data set “mydata” to make this really easy.

mydata <- dat

Also, keep in mind that factor analysis is designed for continuous data. Although it is possible to include categorical data in a factor analysis, the procedures below will not allow that. Keep this in mind when subsetting data.

Step 1: Evaluate data

As with any analysis, begin by evaluating the data you have.

The following evaluations are important. Provided you pay attention to what they tell you, they will frequently save you a huge amount of grief. There is little reward in getting to the middle or end of an analysis and discovering that your data were not appropriate for the tool you are applying.

Try each of the following analyses to familiarize yourself with the data and evaluate what you may need to modify before applying a factor analysis. Later tools are designed to tell you, in advance, whether factor analysis will render a satisfactory set of conclusions in your data.

Think of these as time-savers, even though they are a time investment themselves. The iterative nature of factor analysis can easily result in an exercise in chasing rabbits for a while before giving up due to lack of convergence.

Start by examining the variables you are analyzing.

Basic Descriptive Statistics

If nothing else, look for the number of missings. You can, however, also see whether each variable was read in correctly as a continuous variable.

You may need to clean up problematic variables before going any further.

summary(mydata)

At the very least, identify which variables have the most missing values. They will be the most problematic, as they will force you to reduce the number of observations in the data most.

For the sake of having some output to show, I am using the "*UN.2014.rda**" data from the course website. As you will see, these data are not ideal for factor analysis. But, they do give us the opportunity to see what it is like to clean data and get it ready for this procedure.

colSums(is.na(mydata))

##             life           school         ExSchool              GNI 
##                4                8                5                5 
##              CHI         IneqLiEx   IneqLiEx.index             InEd 
##               50               10               10               42 
##       InEd.index             InIN       InIN.index         QuintRat 
##               43               44               44               78 
##         PalmaRat             Gini       GenderIneq          MatMort 
##               91               57               43               15 
##             AdBR           FemPar        GenderHDI           ImmDTP 
##               10                5               47                2 
##       ImmMeasles      UndFiveMort        Antenatal         SecondEd 
##                2                2               32               33 
##       PrePrimary          Primary        Secondary         Tertiary 
##               21               14               18               30 
##          dropout        PupTeaRat            EdGDP              GDP 
##               40               20               37               15 
##            GDPPC              CPI           FoodPI  PriceVolitility 
##               15               15               68               63 
##        EmpPopRat        VulnerPop         YouthEmp     Unemployment 
##               23              106               82               46 
##       ChildLabor      WorkingPoor   MaternityLeave         BirthReg 
##               89               99               61               34 
##          Pension         refugees         IntTrade              FDI 
##               22                5               22               11 
##   PrivateCapFlow          DevAsst      Remittances         reserves 
##               41               32               38               64 
##              CO2        NatResDep       ForestArea       FreshWater 
##                6               28                6               24 
##     PopOnBadLand     NatDisasters              Pop         YoungPop 
##               49               17                0               10 
##           OldPop         UrbanPop        MedianAge  DependencyYoung 
##               10                0               10               10 
##    DepencencyOld         SexRatio         FemaleEd           MaleEd 
##               10               10               34               34 
##         FeEcPart         MaEcPart           FemHDI          MaleHDI 
##               18               18               47               47 
##       InMortRate        HIVFemale          HIVMale         HIVTotal 
##                2               94               94               94 
##       FemaleMort         MaleMort          Obesity        HealthExp 
##                3                3                7                7 
##         Literacy              GFC           Credit     ExternalDebt 
##               49               23               17               77 
##    WorkingPoor.1     LongUnemploy         Homeless        PrisonPop 
##               99              118               65                5 
##         Homocide     NetMigration         Tourists         Internet 
##               18               10               10                5 
## IncomingIntPhone OutgoingIntPhone       FuelSupply   MDPovertyIndex 
##               80               62               60              104 
##       IntenMDPov   PopNearPoverty PopSeverePoverty        SuperPoor 
##              104              104              104              120 
##           PoorIn        Fertility 
##              104               10

missings <- colSums(is.na(mydata)) # Count # missing in each column

summary(missings) # Evaluate the breakdown of missings

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   10.00   26.00   37.77   59.25  120.00

# missings         # Alternatively, just look at them

mydata <- mydata[ , missings<40] # Keep all columns with less than 40 missings

If there are missings, you will have to eliminate them. You may also choose to eliminate particularly problematic variables (those with many missings) in the interest of maximizing your sample size.

To see the dimensions of the data set, use dim(mydata).

Evaluate the correlation matrix

library(RcmdrMisc)

rcorr.adjust(mydata) # This function is build into R Commander.

## If you want to run this before eliminating missing values use: 
# rcorr.adjust(mydata, use="pairwise.complete.obs") 

write.csv(cor(mydata)>0.8, file="Suspect_Correlations.csv")
write.csv(cor(mydata), file="Correlation_Values.csv")

# Use the CSV file to find variables that are too highly correlated (r>0.90)
# and select one variable from each pair to omit from the data set.
overcorrelated<- c("SecondEd", "OldPop", "FemaleEd", "school", 
                   "IneqLiEx", "UndFiveMort", "YoungPop", 
                   "MaleEd", "InMortRate", "IneqLiEx.index",
                   "GNI", "DepencencyOld")

mydata <- mydata[ ,!(names(mydata) %in% overcorrelated)] # Remove listed variables from data

Evaluate for outliers

Please note that outliers should usually be evaluated carefully and the idea of what is “representative” of a population is something that you should think through before launching into a frenzy of outlier elimination. That is, if there are some observations that seem to be influencing the outcome of your analysis, ask yourself whether those observations are really just realistic reflections of an influential aspect of the society you are analyzing. If they are an important and active part of the population, then they should likely remain in the model.

If, however, there are several observations that are more likely fringe or not reflective of the population at large, then you may choose to remove them from the model. In those cases, consider retaining those observations for a qualitative follow-up to investigate what makes them stand out and why. Fruitful studies can result from examining the “sports” or exceptions from the rule in a population.

Plots: Boxplots & Q-Q plots

A simple, though somewhat time consuming, method for evaluating whether outliers are present is to use visualizations. Although the visualization can be appealing and help you to get to know the data, you may have to look at the variables in chunks, unless they are all measured on the same scale.

You will need to follow up with each variable where you see outliers (points that exist outside the tails of the boxplots) in order to carefully assess and eliminate outlier observations. This means spending extra time and evaluating variables one-by-one. It also implies that you do not really know whether the observation, as a whole (including all variables you are using) stands out as an outlier.

boxplot(mydata[,18:30])

Distance Scaling

You are already familiar with at least one univariate measure of distance scaling. Converting a variable to the z-scale provides a good idea of which observations stand out as different from the rest in a particular variable. That is because z-scores compare each individual observation to what is average (i.e., the variable’s mean) for the variable. Those comparisons are made in light of how much variation exists, on average, within the variable (standard deviation).

The equation used to rescale variables into z-scores is as follows:

\[ z_i=\frac{x_i-\bar{x}}{s} \]

Z-scores, therefore, measure how different each observation is from the average, measured in terms of standard deviations. So, an observation with a z-score of 3 is three standard deviations away from the mean.

If you are looking for a benchmark for judging z-scores, consider that observations with a z-score of 2.58 are in the 99th percentile, or that they differ from 99% of the sample. For me, an absolute value (\(\lvert z_i \rvert\)) of three or greater is a decent cutoff. But, follow your own discretion.

To convert the data set to z-scores, use…

scale(mydata)
boxplot(scale(mydata))

You will have to evaluate each variable separately, and plotting may help. But, this is time consuming and does not take into account whether the observation, as it is represented by a collection of variables, stands out as an outlier.

If you wish to know whether the observation, as a collection of variables, stands out as being exceptionally different from the others, then you need to take a multivariate approach to outlier identification. You saw something like this when we used Cook’s distance to evaluate whether individual observations stood out in a regression as being exceptionally different from the others. Those differences can sometimes bias the calculations and make the output less reflective of the core of the population.

One of the more commonly used evaluations that takes multiple variables into account at once is Mahalanobis distance. In very general terms, Mahalanobis distance treats the observations as existing in n-dimensional space. The various variables provide coordinates in that space. This allows for a measure of difference that is similar to z, in that it calculates the distance of each observation from the centroid of the sample.

Rather than confuse you further with vague analogies, consider instead this succinct explanation of Mahalanobis distance from Data Camp.

You can find some algorithms of outlier detection using Mahalanobis’ distance in the mvoutlier package.

# install.packages("mvoutlier", dependencies=TRUE)

library(mvoutlier)

aq.plot(mydata, quan=0.75)$outliers

## Warning in covMcd(x, alpha = quan): n < 2 * p, i.e., possibly too small sample
## size

## Projection to the first and second robust principal components.
## Proportion of total variation (explained variance): 0.9490672

##     6    11    12    20    24    25    26    38    42    50    53    65    70 
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE 
##    71    74    76    82    83    85    87    89    94    98    99   102   103 
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE 
##   105   108   109   110   111   113   114   115   117   125   126   129   132 
##  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE 
##   134   135   136   138   139   142   145   146   147   151   154   160   162 
## FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE 
##   163   164   165   166   172   173   176   180   181   186   187 
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE

# To see how many outliers were detected using this approach, use
# sum(aq.plot(mydata, quan=0.75)$outliers)

aq.plot(mydata, quan=1)$outliers

## Warning in covMcd(x, alpha = quan): n < 2 * p, i.e., possibly too small sample
## size

## Projection to the first and second robust principal components.
## Proportion of total variation (explained variance): 0.9209626

##     6    11    12    20    24    25    26    38    42    50    53    65    70 
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE 
##    71    74    76    82    83    85    87    89    94    98    99   102   103 
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE 
##   105   108   109   110   111   113   114   115   117   125   126   129   132 
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE 
##   134   135   136   138   139   142   145   146   147   151   154   160   162 
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE 
##   163   164   165   166   172   173   176   180   181   186   187 
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

# To see how many outliers were detected using this approach, use
# sum(aq.plot(mydata, quan=1)$outliers)

dd.plot(mydata)$outliers

## Warning in covMcd(x, alpha = quan): n < 2 * p, i.e., possibly too small sample
## size

##     6    11    12    20    24    25    26    38    42    50    53    65    70 
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE 
##    71    74    76    82    83    85    87    89    94    98    99   102   103 
## FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE 
##   105   108   109   110   111   113   114   115   117   125   126   129   132 
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE 
##   134   135   136   138   139   142   145   146   147   151   154   160   162 
## FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE 
##   163   164   165   166   172   173   176   180   181   186   187 
## FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE  TRUE  TRUE

# To see how many outliers were detected using this approach, use
# sum(dd.plot(mydata)$outliers)

KMO Test

Kaiser provided the following values for interpreting the results:

* 0.00 to 0.49 unacceptable
* 0.50 to 0.59 miserable
* 0.60 to 0.69 mediocre
* 0.70 to 0.79 middling
* 0.80 to 0.89 meritorious
* 0.90 to 1.00 marvelous

# install.packages("psych", dependencies = TRUE)
library(psych)

KMO(mydata)

## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = mydata)
## Overall MSA =  0.59
## MSA for each item = 
##            life        ExSchool         MatMort            AdBR          FemPar 
##            0.86            0.74            0.78            0.70            0.17 
##          ImmDTP      ImmMeasles       Antenatal      PrePrimary         Primary 
##            0.70            0.55            0.44            0.61            0.14 
##       Secondary        Tertiary       PupTeaRat           EdGDP             GDP 
##            0.79            0.73            0.75            0.28            0.45 
##           GDPPC             CPI       EmpPopRat        BirthReg         Pension 
##            0.66            0.34            0.57            0.67            0.67 
##        refugees        IntTrade             FDI         DevAsst     Remittances 
##            0.38            0.35            0.16            0.63            0.11 
##             CO2       NatResDep      ForestArea      FreshWater    NatDisasters 
##            0.77            0.19            0.28            0.35            0.25 
##             Pop        UrbanPop       MedianAge DependencyYoung        SexRatio 
##            0.28            0.58            0.67            0.67            0.60 
##        FeEcPart        MaEcPart      FemaleMort        MaleMort         Obesity 
##            0.46            0.55            0.65            0.69            0.80 
##       HealthExp             GFC          Credit       PrisonPop        Homocide 
##            0.57            0.15            0.66            0.30            0.30 
##    NetMigration        Tourists        Internet       Fertility 
##            0.30            0.50            0.82            0.89

The results (MSA=0.49) are unacceptable. I am going to deal with that by eliminating all low-contribution variables.

mydat <- mydata[, KMO(mydata)$MSAi>0.50] # Get rid of all variables with MSA < 0.50
mydata <- mydat

Let’s take a look at the KMO value now that those problematic variables have been removed.

round( KMO(mydata)$MSA, 2 )

## [1] 0.89

That is much better. These data look more suitable for running a factor model.

Please keep in mind that I do not fully condone this approach. I am just doing it out of expediency at the moment and for the sake of having something to demonstrate output below.

Bartlett’s test for sphericity

Bartlett’s test compares the correlation matrix to an identity matrix (a matrix filled with zeroes). The null hypothesis for Bartlett’s test is:

\[{H_0}:{matrix =identity} \] That is to say, Bartlett’s test lets you rule out that the variables in the data set are essentially uncorrelated. If you fail to reject the null for this test, then you may as well stop. There is nothing in there for you to factor. The variables are all essentially different from one another.

library(psych)
cortest.bartlett(mydata)

Actually running the Factor Analysis

This brings us through steps 2 to 5.

Step 2. Determine Number of Factors to Extract

Theoretically, you could have as many factors as you do variables. But that, of course, defeats the purpose of running a factor analysis.

Rather, we are seeking latent structure within a set of data. Therefore, we will only be interested in factors that explain a substantial proportion of variation within the data.

We have a few ways of doing that. But, they generally all have something to do with the scree plot like the one that we develop below.

# install.packages("nFactors", dependencies = TRUE)

library(nFactors)

## Loading required package: lattice

## 
## Attaching package: 'nFactors'

## The following object is masked from 'package:lattice':
## 
##     parallel

ev <- eigen(cor(mydata)) # get eigenvalues
ev$values

##  [1] 15.89088158  2.09852113  1.82657814  1.13201882  1.02702103  0.89445313
##  [7]  0.65379609  0.58186191  0.52961144  0.42492476  0.40935656  0.36205490
## [13]  0.35611955  0.28487848  0.26191759  0.22357035  0.20112354  0.17531559
## [19]  0.15282141  0.13482767  0.09161749  0.08034115  0.06528069  0.04150550
## [25]  0.03778802  0.02947327  0.01750228  0.01483793

ap <- parallel(subject=nrow(mydata),var=ncol(mydata), rep=100,cent=.05)
nS <- nScree(ev$values, ap$eigen$qevpea)
plotnScree(nS)

Ten factors in a data set this small does not bode well. But, let’s take a look at were this leads us in any case.

Step 3. Extract (and rotate) factors

Now that you have at least some idea of how many factors to extract, you can get started. But, you still have another important decision to make. You need to consider how you plan to rotate the factors.

Think of factor rotation as being similar to focusing a telescope. You are essentially searching for a clearer association between individual factors and the various variables. The way you do this will depend on your assumption about whether the factors should be correlated with one another.

Under ideal circumstances, we would hope that the various factors are independent and, therefore, uncorrelated. But that is not always realistic. For that reason, some analysts suggest starting out with the assumption that the factors are non-independent by using an oblique rotation method. Oblique rotations will provide a correlation matrix that describes the relationships between factors. Use the correlation matrix to evaluate whether the correlations are worth keeping (e.g., r>|0.30|) Alternatively, if the correlations are fairly low (e.g., r<|0.30|), then you could be justified in switching to the assumption that the factors are independent (orthogonal) and should not be correlated.

If you wish to begin with the assumption of correlated (non-independent) factors, you have two oblique rotation option:

Oblique option 1: promax Promax rotation is popular for its ability to handle large datasets efficiently. The approach also tends to result in greater correlation values between factors.

Oblique option 2: oblimin
The direct oblimin rotation approach is somewhat less efficient with large datasets, but can produce a simpler factor structure.

If you wish to begin with the assumption of uncorrelated (independent) factors, you have a few orthogonal rotation options:

Orthogonal option 1: varimax
Varimax rotation is optimized to reduce cross loadings and to minimize smaller loading values, making factor models clearer.

Orthogonal option 2: quartimax
Quartimax rotation works to reduce the number of variables needed to explain a factor, making interpretation easier.

Orthogonal option 2: equamax The equamax option offers a compromise between varimax and quartimax.

After some fiddling, I decided on four factors. This is mostly because ten factors just seemed like too much for a data set this small. We can always go back and try extracting a greater number of factors. But, let’s look at four for now.

We’ll also start with an oblique rotation so that we can see whether there are high enough correlations to stick with this rotation option, or whether we are safe to move on to an orthogonal approach.

Nfacs <- 4  # This is for four factors. You can change this as needed.

fit <- factanal(mydata, Nfacs, rotation="promax")


print(fit, digits=2, cutoff=0.3, sort=TRUE)

## 
## Call:
## factanal(x = mydata, factors = Nfacs, rotation = "promax")
## 
## Uniquenesses:
##            life        ExSchool         MatMort            AdBR          ImmDTP 
##            0.08            0.16            0.16            0.39            0.19 
##      ImmMeasles      PrePrimary       Secondary        Tertiary       PupTeaRat 
##            0.11            0.45            0.14            0.19            0.22 
##           GDPPC       EmpPopRat        BirthReg         Pension         DevAsst 
##            0.01            0.56            0.42            0.34            0.56 
##             CO2        UrbanPop       MedianAge DependencyYoung        SexRatio 
##            0.25            0.49            0.10            0.10            0.63 
##        MaEcPart      FemaleMort        MaleMort         Obesity       HealthExp 
##            0.52            0.04            0.07            0.54            0.66 
##          Credit        Internet       Fertility 
##            0.47            0.16            0.33 
## 
## Loadings:
##                 Factor1 Factor2 Factor3 Factor4
## ExSchool         0.72                          
## MatMort         -0.63    0.39                  
## AdBR            -0.63                          
## PrePrimary       0.59                          
## Secondary        0.84                          
## Tertiary         0.79                          
## PupTeaRat       -0.71                          
## EmpPopRat       -0.56                          
## BirthReg         0.65                          
## Pension          0.90    0.35                  
## MedianAge        0.71                          
## DependencyYoung -0.89                          
## MaEcPart        -0.82                          
## Obesity          0.61                          
## Fertility       -0.52    0.34                  
## life                    -0.73                  
## FemaleMort               0.79                  
## MaleMort                 1.06                  
## GDPPC                   -0.32    0.80          
## CO2              0.50            0.53          
## Internet                -0.32    0.53          
## ImmDTP                                   0.87  
## ImmMeasles                               0.87  
## DevAsst                  0.48                  
## UrbanPop         0.42                          
## SexRatio         0.39                    0.41  
## HealthExp                0.32    0.50          
## Credit                  -0.42    0.48          
## 
##                Factor1 Factor2 Factor3 Factor4
## SS loadings       8.61    3.77    2.12    1.91
## Proportion Var    0.31    0.13    0.08    0.07
## Cumulative Var    0.31    0.44    0.52    0.59
## 
## Factor Correlations:
##         Factor1 Factor2 Factor3 Factor4
## Factor1    1.00   -0.29   -0.30   -0.57
## Factor2   -0.29    1.00    0.19    0.70
## Factor3   -0.30    0.19    1.00    0.31
## Factor4   -0.57    0.70    0.31    1.00
## 
## Test of the hypothesis that 4 factors are sufficient.
## The chi square statistic is 419.06 on 272 degrees of freedom.
## The p-value is 0.000000024

The correlation matrix indicates that the correlations are fairly low. So, we would be justified in moving on to try an oblique rotation.

plot Factor 1 by Factor 2

load <- fit$loadings[,1:2]
plot(load,type="n") # set up plot
text(load,labels=names(mydata),cex=.7)

You can also visualize the factor model to ease your interpretation.

library(psych)

loads <- fit$loadings

fa.diagram(loads)

To simplify the interpretation further, cut down the number of factors you are considering.

loads2 <- fit$loadings[,1:2]
fa.diagram(loads2)

loads3 <- fit$loadings[,3:4]
fa.diagram(loads3)

Step 4. Evaluate what you have produced

From the work above, you will need to consider whether the rotation you selected was appropriate.

If you used an oblique rotation, are there enough fairly substantial correlations (r>0.30) to merit the assumption that the factors are interrelated? If not, you may wish to switch to an orthogonal rotation.
If you used an orthogonal rotation, do the variables load onto factors cleanly? If not, you may wish to switch to the oblique rotation.

Do the factors look stable and distinct?

Do all variables load onto the factors sufficiently? ( |factor loadings| >0.30) If factor loadings aren’t sufficient, you may need to try the method of factor extraction, a different number of factors, whether to retain the variables that don’t load onto a factor.
Do all factors have at least four or more variables loading onto them? If not, you may need to decrease the number of factors.
Are there many crossloadings (variables that load onto more than one factor with values that are within 0.05 of one another)? This can be unavoidable. But, the best solution will generally minimize the number of crossloadings.
Are each of the factors interpretable? Do the variables that load onto each factor have at least a fairly clear theme (based on the variable definitions)? Ultimately, this is what you are trying to produce.

Consistency: Cronbach’s Alpha

How good are your factors?

First, specify the variables in your factor.

f1 <- mydata[ , c("ExSchool", "MatMort", "AdBR", "PrePrimary", 
                  "Secondary", "Tertiary", "PupTeaRat", 
                  "EmpPopRat", "BirthReg", "Pension", "MedianAge", 
                  "DependencyYoung", "MaEcPart", "Obesity", "Fertility")]
f2 <- mydata[ , c("Fertility", "life", "FemaleMort", "MaleMort")]
f3 <- mydata[ , c("GDPPC", "CO2", "Internet", "HealthExp")]
f4 <- mydata[ , c("ImmDTP", "ImmMeasles")]

Next, run Cronbach’s alpha to evaluate the internal consistency of each factor. You are looking for \(\alpha>0.80\).

library(psych)

alpha(f1)$total[1]
alpha(f2)$total[1]
alpha(f3)$total[1]
alpha(f4)$total[1]

##  raw_alpha
##  0.7752715
##  raw_alpha
##  0.6871495
##    raw_alpha
##  0.006105121
##  raw_alpha
##  0.8437348

Okay. Factors two and three don’t look so good in these terms. We will have to go back and reevaluate what we have done.

Step 5. Interpret and write up results

Once we have our final, refined factor representation, it is time to write up the results. Be sure to include all details on how the analysis was run (number of factors, how they were chosen, type of rotation, etc.). Also include the factor scores and the correlation matrix, if you used an oblique rotation. Last, be sure to interpret factors and discuss implications.

So that’s it.

Easy, right?

Good luck!

Exploratory Factor Analysis

Phil Murphy

3/30/2020

Getting to Know Factor Analysis

GENERAL PROCEDURE

Some extra data preparation help

Step 1: Evaluate data

Basic Descriptive Statistics

Evaluate the correlation matrix

Evaluate for outliers

Plots: Boxplots & Q-Q plots

Distance Scaling

KMO Test

Bartlett’s test for sphericity

Actually running the Factor Analysis

Step 2. Determine Number of Factors to Extract

Step 3. Extract (and rotate) factors

If you wish to begin with the assumption of correlated (non-independent) factors, you have two oblique rotation option:

If you wish to begin with the assumption of uncorrelated (independent) factors, you have a few orthogonal rotation options:

plot Factor 1 by Factor 2

Step 4. Evaluate what you have produced

Consistency: Cronbach’s Alpha

Step 5. Interpret and write up results