Exploratory Factor Analysis
Required packages: psych and nFactors
For more info: http://www.statmethods.net/advstats/factor.html
To omit missing variables, create a new data set without them:
Generally, factor analysis (and principal components analysis) does not play well with missing data. You will have to remove cases that have missing data.
Beware, however, just doing that before selecting your variables of interest from a data set. If you leave all variables in the data set when you remove missing cases, then you will frequently discover that a particularly problematic variable (one with a lot of missings) will cause you to cull a lot of observations that you would be able to use otherwise.
The code for removing missing cases is straightforward, but it is a blunt force tool.
If, in the course of doing a factor analysis, you decide to discard one or more variables, then you are strongly advised to go back and subset the original data set and removing missings from that before continuing.
To remove missings, use: dat <- na.omit(my.Original.data) # note: I rename the data set each time to save work later.
Also, since these examples use the hypothetical data set “mydata”, you may want to go ahead and rename your data set “mydata” to make this really easy.
mydata <- dat
Also, keep in mind that factor analysis is designed for continuous data. Although it is possible to include categorical data in a factor analysis, the procedures below will not allow that. Keep this in mind when subsetting data.
As with any analysis, begin by evaluating the data you have.
The following evaluations are important. Provided you pay attention to what they tell you, they will frequently save you a huge amount of grief. There is little reward in getting to the middle or end of an analysis and discovering that your data were not appropriate for the tool you are applying.
Try each of the following analyses to familiarize yourself with the data and evaluate what you may need to modify before applying a factor analysis. Later tools are designed to tell you, in advance, whether factor analysis will render a satisfactory set of conclusions in your data.
Think of these as time-savers, even though they are a time investment themselves. The iterative nature of factor analysis can easily result in an exercise in chasing rabbits for a while before giving up due to lack of convergence.
Start by examining the variables you are analyzing.
If nothing else, look for the number of missings. You can, however, also see whether each variable was read in correctly as a continuous variable.
You may need to clean up problematic variables before going any further.
summary(mydata)
At the very least, identify which variables have the most missing values. They will be the most problematic, as they will force you to reduce the number of observations in the data most.
For the sake of having some output to show, I am using the "*UN.2014.rda**" data from the course website. As you will see, these data are not ideal for factor analysis. But, they do give us the opportunity to see what it is like to clean data and get it ready for this procedure.
colSums(is.na(mydata))
## life school ExSchool GNI
## 4 8 5 5
## CHI IneqLiEx IneqLiEx.index InEd
## 50 10 10 42
## InEd.index InIN InIN.index QuintRat
## 43 44 44 78
## PalmaRat Gini GenderIneq MatMort
## 91 57 43 15
## AdBR FemPar GenderHDI ImmDTP
## 10 5 47 2
## ImmMeasles UndFiveMort Antenatal SecondEd
## 2 2 32 33
## PrePrimary Primary Secondary Tertiary
## 21 14 18 30
## dropout PupTeaRat EdGDP GDP
## 40 20 37 15
## GDPPC CPI FoodPI PriceVolitility
## 15 15 68 63
## EmpPopRat VulnerPop YouthEmp Unemployment
## 23 106 82 46
## ChildLabor WorkingPoor MaternityLeave BirthReg
## 89 99 61 34
## Pension refugees IntTrade FDI
## 22 5 22 11
## PrivateCapFlow DevAsst Remittances reserves
## 41 32 38 64
## CO2 NatResDep ForestArea FreshWater
## 6 28 6 24
## PopOnBadLand NatDisasters Pop YoungPop
## 49 17 0 10
## OldPop UrbanPop MedianAge DependencyYoung
## 10 0 10 10
## DepencencyOld SexRatio FemaleEd MaleEd
## 10 10 34 34
## FeEcPart MaEcPart FemHDI MaleHDI
## 18 18 47 47
## InMortRate HIVFemale HIVMale HIVTotal
## 2 94 94 94
## FemaleMort MaleMort Obesity HealthExp
## 3 3 7 7
## Literacy GFC Credit ExternalDebt
## 49 23 17 77
## WorkingPoor.1 LongUnemploy Homeless PrisonPop
## 99 118 65 5
## Homocide NetMigration Tourists Internet
## 18 10 10 5
## IncomingIntPhone OutgoingIntPhone FuelSupply MDPovertyIndex
## 80 62 60 104
## IntenMDPov PopNearPoverty PopSeverePoverty SuperPoor
## 104 104 104 120
## PoorIn Fertility
## 104 10
missings <- colSums(is.na(mydata)) # Count # missing in each column
summary(missings) # Evaluate the breakdown of missings
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 10.00 26.00 37.77 59.25 120.00
# missings # Alternatively, just look at them
mydata <- mydata[ , missings<40] # Keep all columns with less than 40 missings
If there are missings, you will have to eliminate them. You may also choose to eliminate particularly problematic variables (those with many missings) in the interest of maximizing your sample size.
To see the dimensions of the data set, use dim(mydata)
.
library(RcmdrMisc)
rcorr.adjust(mydata) # This function is build into R Commander.
## If you want to run this before eliminating missing values use:
# rcorr.adjust(mydata, use="pairwise.complete.obs")
write.csv(cor(mydata)>0.8, file="Suspect_Correlations.csv")
write.csv(cor(mydata), file="Correlation_Values.csv")
# Use the CSV file to find variables that are too highly correlated (r>0.90)
# and select one variable from each pair to omit from the data set.
overcorrelated<- c("SecondEd", "OldPop", "FemaleEd", "school",
"IneqLiEx", "UndFiveMort", "YoungPop",
"MaleEd", "InMortRate", "IneqLiEx.index",
"GNI", "DepencencyOld")
mydata <- mydata[ ,!(names(mydata) %in% overcorrelated)] # Remove listed variables from data
Please note that outliers should usually be evaluated carefully and the idea of what is “representative” of a population is something that you should think through before launching into a frenzy of outlier elimination. That is, if there are some observations that seem to be influencing the outcome of your analysis, ask yourself whether those observations are really just realistic reflections of an influential aspect of the society you are analyzing. If they are an important and active part of the population, then they should likely remain in the model.
If, however, there are several observations that are more likely fringe or not reflective of the population at large, then you may choose to remove them from the model. In those cases, consider retaining those observations for a qualitative follow-up to investigate what makes them stand out and why. Fruitful studies can result from examining the “sports” or exceptions from the rule in a population.
A simple, though somewhat time consuming, method for evaluating whether outliers are present is to use visualizations. Although the visualization can be appealing and help you to get to know the data, you may have to look at the variables in chunks, unless they are all measured on the same scale.
You will need to follow up with each variable where you see outliers (points that exist outside the tails of the boxplots) in order to carefully assess and eliminate outlier observations. This means spending extra time and evaluating variables one-by-one. It also implies that you do not really know whether the observation, as a whole (including all variables you are using) stands out as an outlier.
boxplot(mydata[,18:30])
You are already familiar with at least one univariate measure of distance scaling. Converting a variable to the z-scale provides a good idea of which observations stand out as different from the rest in a particular variable. That is because z-scores compare each individual observation to what is average (i.e., the variable’s mean) for the variable. Those comparisons are made in light of how much variation exists, on average, within the variable (standard deviation).
The equation used to rescale variables into z-scores is as follows:
\[ z_i=\frac{x_i-\bar{x}}{s} \]
Z-scores, therefore, measure how different each observation is from the average, measured in terms of standard deviations. So, an observation with a z-score of 3 is three standard deviations away from the mean.
If you are looking for a benchmark for judging z-scores, consider that observations with a z-score of 2.58 are in the 99th percentile, or that they differ from 99% of the sample. For me, an absolute value (\(\lvert z_i \rvert\)) of three or greater is a decent cutoff. But, follow your own discretion.
To convert the data set to z-scores, use…
scale(mydata)
boxplot(scale(mydata))
You will have to evaluate each variable separately, and plotting may help. But, this is time consuming and does not take into account whether the observation, as it is represented by a collection of variables, stands out as an outlier.
If you wish to know whether the observation, as a collection of variables, stands out as being exceptionally different from the others, then you need to take a multivariate approach to outlier identification. You saw something like this when we used Cook’s distance to evaluate whether individual observations stood out in a regression as being exceptionally different from the others. Those differences can sometimes bias the calculations and make the output less reflective of the core of the population.
One of the more commonly used evaluations that takes multiple variables into account at once is Mahalanobis distance. In very general terms, Mahalanobis distance treats the observations as existing in n-dimensional space. The various variables provide coordinates in that space. This allows for a measure of difference that is similar to z, in that it calculates the distance of each observation from the centroid of the sample.
Rather than confuse you further with vague analogies, consider instead this succinct explanation of Mahalanobis distance from Data Camp.
You can find some algorithms of outlier detection using Mahalanobis’ distance in the mvoutlier
package.
# install.packages("mvoutlier", dependencies=TRUE)
library(mvoutlier)
aq.plot(mydata, quan=0.75)$outliers
## Warning in covMcd(x, alpha = quan): n < 2 * p, i.e., possibly too small sample
## size
## Projection to the first and second robust principal components.
## Proportion of total variation (explained variance): 0.9490672
## 6 11 12 20 24 25 26 38 42 50 53 65 70
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 71 74 76 82 83 85 87 89 94 98 99 102 103
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 105 108 109 110 111 113 114 115 117 125 126 129 132
## TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 134 135 136 138 139 142 145 146 147 151 154 160 162
## FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## 163 164 165 166 172 173 176 180 181 186 187
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
# To see how many outliers were detected using this approach, use
# sum(aq.plot(mydata, quan=0.75)$outliers)
aq.plot(mydata, quan=1)$outliers
## Warning in covMcd(x, alpha = quan): n < 2 * p, i.e., possibly too small sample
## size
## Projection to the first and second robust principal components.
## Proportion of total variation (explained variance): 0.9209626
## 6 11 12 20 24 25 26 38 42 50 53 65 70
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 71 74 76 82 83 85 87 89 94 98 99 102 103
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 105 108 109 110 111 113 114 115 117 125 126 129 132
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 134 135 136 138 139 142 145 146 147 151 154 160 162
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 163 164 165 166 172 173 176 180 181 186 187
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# To see how many outliers were detected using this approach, use
# sum(aq.plot(mydata, quan=1)$outliers)
dd.plot(mydata)$outliers
## Warning in covMcd(x, alpha = quan): n < 2 * p, i.e., possibly too small sample
## size
## 6 11 12 20 24 25 26 38 42 50 53 65 70
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 71 74 76 82 83 85 87 89 94 98 99 102 103
## FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 105 108 109 110 111 113 114 115 117 125 126 129 132
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
## 134 135 136 138 139 142 145 146 147 151 154 160 162
## FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 163 164 165 166 172 173 176 180 181 186 187
## FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE TRUE TRUE
# To see how many outliers were detected using this approach, use
# sum(dd.plot(mydata)$outliers)
Kaiser provided the following values for interpreting the results:
* 0.00 to 0.49 unacceptable
* 0.50 to 0.59 miserable
* 0.60 to 0.69 mediocre
* 0.70 to 0.79 middling
* 0.80 to 0.89 meritorious
* 0.90 to 1.00 marvelous
# install.packages("psych", dependencies = TRUE)
library(psych)
KMO(mydata)
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = mydata)
## Overall MSA = 0.59
## MSA for each item =
## life ExSchool MatMort AdBR FemPar
## 0.86 0.74 0.78 0.70 0.17
## ImmDTP ImmMeasles Antenatal PrePrimary Primary
## 0.70 0.55 0.44 0.61 0.14
## Secondary Tertiary PupTeaRat EdGDP GDP
## 0.79 0.73 0.75 0.28 0.45
## GDPPC CPI EmpPopRat BirthReg Pension
## 0.66 0.34 0.57 0.67 0.67
## refugees IntTrade FDI DevAsst Remittances
## 0.38 0.35 0.16 0.63 0.11
## CO2 NatResDep ForestArea FreshWater NatDisasters
## 0.77 0.19 0.28 0.35 0.25
## Pop UrbanPop MedianAge DependencyYoung SexRatio
## 0.28 0.58 0.67 0.67 0.60
## FeEcPart MaEcPart FemaleMort MaleMort Obesity
## 0.46 0.55 0.65 0.69 0.80
## HealthExp GFC Credit PrisonPop Homocide
## 0.57 0.15 0.66 0.30 0.30
## NetMigration Tourists Internet Fertility
## 0.30 0.50 0.82 0.89
The results (MSA=0.49) are unacceptable. I am going to deal with that by eliminating all low-contribution variables.
mydat <- mydata[, KMO(mydata)$MSAi>0.50] # Get rid of all variables with MSA < 0.50
mydata <- mydat
Let’s take a look at the KMO value now that those problematic variables have been removed.
round( KMO(mydata)$MSA, 2 )
## [1] 0.89
That is much better. These data look more suitable for running a factor model.
Please keep in mind that I do not fully condone this approach. I am just doing it out of expediency at the moment and for the sake of having something to demonstrate output below.
Bartlett’s test compares the correlation matrix to an identity matrix (a matrix filled with zeroes). The null hypothesis for Bartlett’s test is:
\[{H_0}:{matrix =identity} \] That is to say, Bartlett’s test lets you rule out that the variables in the data set are essentially uncorrelated. If you fail to reject the null for this test, then you may as well stop. There is nothing in there for you to factor. The variables are all essentially different from one another.
library(psych)
cortest.bartlett(mydata)
This brings us through steps 2 to 5.
Theoretically, you could have as many factors as you do variables. But that, of course, defeats the purpose of running a factor analysis.
Rather, we are seeking latent structure within a set of data. Therefore, we will only be interested in factors that explain a substantial proportion of variation within the data.
We have a few ways of doing that. But, they generally all have something to do with the scree plot like the one that we develop below.
# install.packages("nFactors", dependencies = TRUE)
library(nFactors)
## Loading required package: lattice
##
## Attaching package: 'nFactors'
## The following object is masked from 'package:lattice':
##
## parallel
ev <- eigen(cor(mydata)) # get eigenvalues
ev$values
## [1] 15.89088158 2.09852113 1.82657814 1.13201882 1.02702103 0.89445313
## [7] 0.65379609 0.58186191 0.52961144 0.42492476 0.40935656 0.36205490
## [13] 0.35611955 0.28487848 0.26191759 0.22357035 0.20112354 0.17531559
## [19] 0.15282141 0.13482767 0.09161749 0.08034115 0.06528069 0.04150550
## [25] 0.03778802 0.02947327 0.01750228 0.01483793
ap <- parallel(subject=nrow(mydata),var=ncol(mydata), rep=100,cent=.05)
nS <- nScree(ev$values, ap$eigen$qevpea)
plotnScree(nS)
Ten factors in a data set this small does not bode well. But, let’s take a look at were this leads us in any case.
Now that you have at least some idea of how many factors to extract, you can get started. But, you still have another important decision to make. You need to consider how you plan to rotate the factors.
Think of factor rotation as being similar to focusing a telescope. You are essentially searching for a clearer association between individual factors and the various variables. The way you do this will depend on your assumption about whether the factors should be correlated with one another.
Under ideal circumstances, we would hope that the various factors are independent and, therefore, uncorrelated. But that is not always realistic. For that reason, some analysts suggest starting out with the assumption that the factors are non-independent by using an oblique rotation method. Oblique rotations will provide a correlation matrix that describes the relationships between factors. Use the correlation matrix to evaluate whether the correlations are worth keeping (e.g., r>|0.30|) Alternatively, if the correlations are fairly low (e.g., r<|0.30|), then you could be justified in switching to the assumption that the factors are independent (orthogonal) and should not be correlated.
load <- fit$loadings[,1:2]
plot(load,type="n") # set up plot
text(load,labels=names(mydata),cex=.7)
You can also visualize the factor model to ease your interpretation.
library(psych)
loads <- fit$loadings
fa.diagram(loads)
To simplify the interpretation further, cut down the number of factors you are considering.
loads2 <- fit$loadings[,1:2]
fa.diagram(loads2)
loads3 <- fit$loadings[,3:4]
fa.diagram(loads3)
From the work above, you will need to consider whether the rotation you selected was appropriate.
Do the factors look stable and distinct?
How good are your factors?
First, specify the variables in your factor.
f1 <- mydata[ , c("ExSchool", "MatMort", "AdBR", "PrePrimary",
"Secondary", "Tertiary", "PupTeaRat",
"EmpPopRat", "BirthReg", "Pension", "MedianAge",
"DependencyYoung", "MaEcPart", "Obesity", "Fertility")]
f2 <- mydata[ , c("Fertility", "life", "FemaleMort", "MaleMort")]
f3 <- mydata[ , c("GDPPC", "CO2", "Internet", "HealthExp")]
f4 <- mydata[ , c("ImmDTP", "ImmMeasles")]
Next, run Cronbach’s alpha to evaluate the internal consistency of each factor. You are looking for \(\alpha>0.80\).
library(psych)
alpha(f1)$total[1]
alpha(f2)$total[1]
alpha(f3)$total[1]
alpha(f4)$total[1]
## raw_alpha
## 0.7752715
## raw_alpha
## 0.6871495
## raw_alpha
## 0.006105121
## raw_alpha
## 0.8437348
Okay. Factors two and three don’t look so good in these terms. We will have to go back and reevaluate what we have done.
Once we have our final, refined factor representation, it is time to write up the results. Be sure to include all details on how the analysis was run (number of factors, how they were chosen, type of rotation, etc.). Also include the factor scores and the correlation matrix, if you used an oblique rotation. Last, be sure to interpret factors and discuss implications.
So that’s it.
Easy, right?
Good luck!