Preparing Data for Use in Factor Analysis

Where to start

Conventional factor analysis requires that you have continuous data. But, to do a meaningful job of factor analysis, you will also want variables that are generally confined to a common theme or conceptual space. You shouldn’t expect that chucking a bunch of unrelated information into a data set will necessarily reveal some hidden secrets to life, the universe and everything. There are much better means of doing that.

If you are interested in a workflow for preparing your data set for a factor analysis, consider this.

Remove non-continuous & off-theme variables
Remove missings
Evaluate correlation matrix
KMO Measures
Bartlett’s test

Let’s take each of these in turn.

Removing non-continuous & off-theme variables

This part of the process could be better summed up as “know your data.”

Conventional factor analysis requires that we have continuous data. So, your first pass through the data may take the form of just looking at the variable names and deciding whether some are obviously not going to be useful. The objective is to create a subset of your original data, so that you may use the subset to run the factor analysis.

It is, however, more likely that you will find the information that you are looking for by evaluating summary measures of the data.

# Get variable names
names(mydata)

# Consider summary measures
summary(mydata)

To remove the variables that are not usable in the factor analysis, subset it in a manner with which you are comfortable. Here, we just use base R to remove the first three variables, as well as the seventh variable.

mydata <- mydata[ , -c(1:3, 7)]

Although continuous variables will be fairly easy to spot, others, such as Likert-type variables, may require your attention. Scaled response variables may be converted to numeric values. There are multiple means of accomplishing this. I have provided just one such method, below.

In the following code example, I was working with a set of opinion questions that were scaled as: “Very Dissatisfied”, “Dissatisfied”, “Neutral”, “Satisfied”, “Very Satisfied”, and “Don’t Know”. Missing values in this data set were indicated by empty cells, which R will read as "".

The code was written to cover the entire data set. If you wish to use this with just one variable, or a set of them, you will have to modify this to include only the variables that you wish to recode.

To use ifelse(), you will first identify the value that you wish to replace. Next, you enter the value you would like to replace it with in the next space. Finally, you enter the value that should be entered otherwise. That last bit of information can, therefore, be another ifelse statement. That is what we have here.

One word of warning is that recoding variables may result in catastrophic mistakes. So, it is a very good idea to create a temporary version of the data to modify. That way, you can easily roll back to the original data and start over.

Once you are happy with the result, you can rename the temporary version to the name you wish to use for your data. Here, that name is “mydata”.

# Create a temporary version of the data set
Temp <- mydata

# Convert Likert-type scale to numeric
Temp <- ifelse(Temp=="", NA,     # Empty cells treated as missing
             ifelse(Temp=="Very Dissatisfied",-2,
             ifelse(Temp=="Dissatisfied", -1,
             ifelse(Temp=="Neutral", 0,
             ifelse(Temp=="Don't Know", 0,  # Don't know treated as noncommittal (neutral)
             ifelse(Temp=="Satisfied", 1,
             ifelse(Temp=="Very Satisfied", 2, NA)))))))

# Convert the data to a data frame
mydata <- as.data.frame(Temp)

The next thing to consider is whether all the variables really belong in the analysis. They should all fall into the theme that you wish to explore. For this task, you will want to read through the variable names, if those are meaningful, or spend time going through the data definitions.

Develop a list of any variables that you feel are off-topic or just extraneous to what you are hoping to explore. Once you have completed the process, remove the variables as you did above.

Remove missing observations

This step is a little tricky. That is because, it is likely a lot better to remove missing observations from the data after all omitted variables have been removed in the steps that follow this one.

This is because, removing missings will remove any observation that is missing data for even just one variable. It is a fairly radical action. So, it is a good idea to remove all the variables that will not work in the analysis first and then remove observations with missing data at the end of the data preparation process.

Although the steps that follow this one may very well remove problematic variables that have an inordinate amount of missing observations, some of the following methods don’t play well with missing data. A good option to simply stripping out missing observation early in this process is to virtually remove missings, just for the purpose of running a particular function.

If have a plethora of data and just wish to remove all missing data at the outset, then use the following code to overwrite your data object without the observations that contain missing data.

mydata <- na.omit(mydata)

If, however, you prefer to run the other functions without permanently deleting observations with missing data, then you may use na.omit(mydata) in place of mydata in those functions. The result will look a little clunkier. But, it will allow you to be more flexible and possibly retain more of the observations than you would have otherwise.

Using correlation as an example, first consider the correlation function and next the same function with the modified code. (Note: the cor() function has a built-in argument for dealing with missings. But, using this method will provide a little consistency and a reminder that you will have to eventually remove missings.)

# Correlation function
cor(mydata)

# Correlation function, utilizing maximal observations
cor(mydata, use="pairwise.complete.obs")

# Same function, but with a reminder of the missings
cor(na.omit(mydata))

Evaluate the Correlation Matrix

For a factor analysis to run successfully, there should be some relationships present between the variables. Though, we are concerned when those relationships are too strong (r>0.90 or so). Factors should refelct the unique contribution of each variable. When two varaibles are highly correlated, we have unnecessary redundancy that is potentially unhelpful to the analysis.

You may use any of the three options, above, to check the correlations within your data. Though, you are likely to find that even a moderately sized correlation matrix can quickly overwhelm the R console and be functionally illegible. A simple workaround is to save the correlation matrix in a spreadsheet and view them there.

write.csv(cor(mydata), file="Correlation_Values.csv")

Remember to add na.omit or other appropriate arguments to deal with missing data, if applicable.

KMO Measures

KMO measures “sampling adequacy” of the data set as a whole, and provides measures for the individual variables as well. Kaiser presented his breakdown for the measure as follows:

0.00 to 0.49 unacceptable
0.50 to 0.59 miserable
0.60 to 0.69 mediocre
0.70 to 0.79 middling
0.80 to 0.89 meritorious
0.90 to 1.00 marvelous

Your takeaway from this should be that you need, at minimum, a KMO measure of 0.50 or greater to have any hope of deriving a useful factor model from your data. The psych package offers an excellent version of the KMO test.

library(psych)

KMO(mydata)

If your data, as a whole, produce a KMO measure of less than 0.50, you have two options: give up; or use the information in the KMO output to select particularly problemmatic variables from the data. If you do remove variables, re-run KMO to check whether the data are more usable at that point.

Bartlett’s Test for Sphericity

Bartlett’s test is comparing the correlation matrix we have with a matrix that mimics variables that are entirely uncorrelated (i.e., an “identity matrix”). It is testing whether the relationships present in the data are meaningfully different from uncorrelated variables. In other words, are the relationships present in the data sufficient for running a factor analysis.

The null hypothesis is that there is no difference between the correlation matrix and its identity matrix. If we fail to reject the null, then running a factor analysis would be a waste of our time and effort.

Here, we run Barteltt’s test using the psych package.

library(psych)
cortest.bartlett(mydata)

If you reach this point and have not already done so, it is now time to remove missing observations from the data.

mydata <- na.omit(mydata)

That is about it.

it is possible to run a factor analysis without first running through some or all of these steps. But, these are meant to smooth the way for the analysis to go well.

Good luck in your work!

Just the scripts…

# Remove any observation IDs or other unuseful variables
mydata <- mydata[,-c(1:3)]

## Evaluate the data set
summary(mydata)

# Remove missing data
# mydata <- na.omit(mydata) # Or: hold off until the end

## Correlations --> Avoid any high (>0.90) correlations
## If you want to run this before eliminating missing values use: 
# cor(mydata, use="pairwise.complete.obs") 

write.csv(cor(na.omit(mydata), file="Correlation_Values.csv") # Write correlations to csv

# I chose to remove variables number 3 & 34
mydata <- mydata[ , -c(3,34)]

### KMO Test  
#   Kaiser provided the following values for interpreting the results:  
#   
#     0.00 to 0.49 unacceptable
#     0.50 to 0.59 miserable
#     0.60 to 0.69 mediocre
#     0.70 to 0.79 middling
#     0.80 to 0.89 meritorious
#     0.90 to 1.00 marvelous

library(psych)

KMO(mydata)


### Bartlett's test for sphericity
# Null: Correlation Matrix is an identity matrix
library(psych)
cortest.bartlett(mydata)


## Once all that is done and I have removed all the 
##   variables I need to remove: 

# Remove missings

mydata <- na.omit(mydata)