Using R for Social Work Research: Exploratory Factor Analysis

Introduction

This document is one of a series designed to illustrate how the R statistical computing environment can be used to conduct various types of social work research. In this report, we present an example of using R to conduct an exploratory factor analysis (EFA) of a set of items designed to measure the perceptions of middle and high school students had about their school. The items evolved over a series of middle and high school needs assessment projects and the EFA was designed to assess the extent to which they measured a construct we call school climate.

In this example we use the psych package to fit and assess an EFA. The psych package includes an extensive set of tools and procedures designed to support and facilitate various measurement assessments - it is a veritable swiss army knife for psychometric studies. The package gives applied researchers, teachers, and statisticians access to free, open-source, and commercial quality methods for various measurement related statistical tasks and procedures. It rivals many proprietary statistical packages (SPSS, Stata, MPlus, EQS) in terms of features and capacities.

The EFA Process

The EFA process is composed of the following steps and decisions:

Determine the appropriate correlations to be used in the analysis
Assess the factorability of the correlation matrix
Determine the number of factors to extract
Specify a factor model and conduct the factor analysis
Assess and interpret results

The Scale

The items in the scale are:

My school is fair
My school helps all students
My school is safe
My school encourages me to be involved in activities
My school cares about students
My school has high expectations for students
My school is friendly
My school is supportive
My school is improving
My school is a positive in my life
My school is an exciting place
My school is well regarded in the community
My school cares about families
My school welcomes parents
My school values parent’s ideas

Items were measured on a Likert scale with the response categories Strongly disagree, Disagree, Can’t decide, Agree, and Strongly agree (using 1 - 5 numeric coding scheme).

The data used in the study came from 3,210 seventh grade students in seventeen school districts in a large mid-western urban county. As mentioned above, these data were collected as a part of a large-scale needs assessment designed to measure various student risk and protective factors. It is important to note that the items included in the scale were carefully vetted by school administrators, teachers, parents, and students and all were considered to be important aspects of school climate. The goal of the development process was to come up with a relatively short scale that could be used in needs assessments and evaluation studies.

Preliminaries

First, we load the psych package plus a few other helpful packages as follows:

##Load Libraries
library(psych)
library(corrplot)
library(polycor)

We load the data file and examine the first few records:

##   c.1 c.2 c.3 c.4 c.5 c.6 c.7 c.8 c.9 c.10 c.11 c.12 c.13 c.14 c.15
## 1   5   5   5   5   5   5   5   5   5    5    4    5    5    5    5
## 2   4   3   4   4   5   4   3   3   4    3    4    4    2    4    3
## 3   3   4   4   4   4   4   4   4   4    4    4    4    4    4    4
## 4   4   3   4   4   5   5   5   5   5    5    1    3    4    4    5
## 5   1   4   1   1   4   5   4   4   4    1    4    1    4    4    1
## 6   4   2   3   4   4   4   3   4   4    4    2    4    3    4    3

Next, we examine basic descriptive statistics. This information can be quite helpful in determining if there are missing cases and if there are data irregularities you need to address such as high degrees of skewness and/or kurtosis. We see that there are no missing cases nor does there appear to be problems with distribution shape.

describe(data)

##      vars    n mean   sd median trimmed  mad min max range  skew kurtosis   se
## c.1     1 3210 3.39 1.13      4    3.47 1.48   1   5     4 -0.61    -0.39 0.02
## c.2     2 3210 3.53 1.15      4    3.62 1.48   1   5     4 -0.63    -0.37 0.02
## c.3     3 3210 3.74 1.08      4    3.88 1.48   1   5     4 -0.88     0.31 0.02
## c.4     4 3210 3.68 1.10      4    3.80 1.48   1   5     4 -0.80     0.07 0.02
## c.5     5 3210 3.79 1.09      4    3.94 1.48   1   5     4 -0.96     0.42 0.02
## c.6     6 3210 3.92 1.02      4    4.06 1.48   1   5     4 -1.00     0.74 0.02
## c.7     7 3210 3.57 1.12      4    3.67 1.48   1   5     4 -0.68    -0.15 0.02
## c.8     8 3210 3.72 1.05      4    3.84 1.48   1   5     4 -0.89     0.44 0.02
## c.9     9 3210 3.57 1.14      4    3.68 1.48   1   5     4 -0.65    -0.26 0.02
## c.10   10 3210 3.51 1.17      4    3.62 1.48   1   5     4 -0.62    -0.34 0.02
## c.11   11 3210 3.18 1.23      3    3.23 1.48   1   5     4 -0.26    -0.87 0.02
## c.12   12 3210 3.55 1.09      4    3.65 1.48   1   5     4 -0.64    -0.02 0.02
## c.13   13 3210 3.48 1.11      4    3.56 1.48   1   5     4 -0.54    -0.28 0.02
## c.14   14 3210 3.82 0.99      4    3.94 1.48   1   5     4 -0.98     0.93 0.02
## c.15   15 3210 3.34 1.13      3    3.41 1.48   1   5     4 -0.40    -0.41 0.02

EFA process

Step 1: Compute the Appropriate Correlations

Because our measures are ordinal, we compute a matrix of polychoric correlations that we will then use in the analyses. We examine the correlations in a table:

##       c.1  c.2  c.3  c.4  c.5  c.6  c.7  c.8  c.9 c.10 c.11 c.12 c.13 c.14 c.15
## c.1  1.00 0.73 0.65 0.58 0.68 0.55 0.67 0.68 0.62 0.64 0.56 0.62 0.59 0.54 0.54
## c.2  0.73 1.00 0.69 0.62 0.76 0.61 0.67 0.73 0.65 0.65 0.54 0.62 0.64 0.59 0.57
## c.3  0.65 0.69 1.00 0.58 0.68 0.58 0.69 0.70 0.62 0.63 0.51 0.65 0.59 0.56 0.52
## c.4  0.58 0.62 0.58 1.00 0.69 0.61 0.60 0.67 0.59 0.60 0.52 0.59 0.62 0.59 0.53
## c.5  0.68 0.76 0.68 0.69 1.00 0.70 0.72 0.80 0.69 0.67 0.56 0.66 0.72 0.66 0.62
## c.6  0.55 0.61 0.58 0.61 0.70 1.00 0.60 0.67 0.60 0.56 0.45 0.61 0.58 0.58 0.52
## c.7  0.67 0.67 0.69 0.60 0.72 0.60 1.00 0.82 0.69 0.71 0.59 0.67 0.65 0.60 0.56
## c.8  0.68 0.73 0.70 0.67 0.80 0.67 0.82 1.00 0.76 0.73 0.59 0.70 0.70 0.66 0.64
## c.9  0.62 0.65 0.62 0.59 0.69 0.60 0.69 0.76 1.00 0.69 0.59 0.64 0.64 0.60 0.61
## c.10 0.64 0.65 0.63 0.60 0.67 0.56 0.71 0.73 0.69 1.00 0.70 0.67 0.66 0.60 0.59
## c.11 0.56 0.54 0.51 0.52 0.56 0.45 0.59 0.59 0.59 0.70 1.00 0.58 0.59 0.53 0.54
## c.12 0.62 0.62 0.65 0.59 0.66 0.61 0.67 0.70 0.64 0.67 0.58 1.00 0.70 0.63 0.60
## c.13 0.59 0.64 0.59 0.62 0.72 0.58 0.65 0.70 0.64 0.66 0.59 0.70 1.00 0.75 0.73
## c.14 0.54 0.59 0.56 0.59 0.66 0.58 0.60 0.66 0.60 0.60 0.53 0.63 0.75 1.00 0.73
## c.15 0.54 0.57 0.52 0.53 0.62 0.52 0.56 0.64 0.61 0.59 0.54 0.60 0.73 0.73 1.00

We see in the table that very few of the pairwise correlations fall below .50. We consider the correlations to be large.

Step 2: Assess Factorability

Next, we compute a Kaiser-Meyer-Olkin (KMO) Test on Sampling Adequacy to assess the factorability of our correlation matrix. The KMO tests to see if the partial correlations in the data are close enough to zero to suggest there is at least one latent factor underlying our items. A KMO has a possible range of 0 - 1.00 with higher KMO values reflecting higher levels of factorability. Our Overall MSA (measure of sample adequacy) = .97 is excellent as are the MSA coefficients for each item. We continue with our factoring.

## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = poly_values$rho)
## Overall MSA =  0.97
## MSA for each item = 
##  c.1  c.2  c.3  c.4  c.5  c.6  c.7  c.8  c.9 c.10 c.11 c.12 c.13 c.14 c.15 
## 0.97 0.97 0.98 0.98 0.97 0.98 0.96 0.96 0.98 0.97 0.96 0.98 0.96 0.96 0.96

Step 3: Determine the Number of Factors

To determine the number of factors it is useful to examine eigenvalues and to display them in a scree plot. Eigenvalues are measures of the variance accounted for by a factor. An examination of eigenvalues can provide insights into how many factors a set of correlations contains. We display the first five eigenvalues below. There is a substantial difference between the first eigenvalue = 9.865 and the second eigenvalue = 0.774 suggesting that a one factor solution should be explored.

## [1] 9.87 0.77 0.66 0.48 0.44

The scree plot visually presents eigenvalue data. The large difference between the first eigenvalue all all others is dramatic. The usual interpretation of a scree plot is to look for an ‘elbow’ or a large drop-off in values that provides information about the number of potential factors. Our scree plot affirms what we already know about our eigenvalues - there is one dominate value indicating that one factor is plausible.

Step 4: Conduct the Factor Analysis

We now proceed with the actual factor analysis. One of the first decisions in this step is to decide which factor model to use. In general, we can consider using a principle components model or a common factor model. (Note: The literature about these models is extensive and we leave it to the reader to read about the differences). Since our goal is to examine the relationship between our items and the school climate latent variable, we choose a common factor model. More specifically, we decide to fit the model using principal axis factoring methods specifying a one-factor model.

The standardized factor loadings from our principal axis factoring output are shown below. The loadings (PA1) are partial regression coefficients and express the strength of the relationship between an item and the latent trait. While there is some variability in recommended interpretation thresholds, a value of .37 or larger has been suggested as a substantive or practically significant cut-off. Given that reference value, all of the loadings would be considered quite large. They range from .700 (c.1 - My school is an exciting place) to .894 (c.8 - My school is supportive). In addition, the Proportion Var = .63 indicates that 63.0% of the item common variance is accounted for by the factor. While there is no suggested standard for interpretation, it is a ‘higher is better’ measure. We consider our value to be substantive.

## 
## Loadings:
##      PA1 
## c.1  0.78
## c.2  0.82
## c.3  0.78
## c.4  0.75
## c.5  0.87
## c.6  0.74
## c.7  0.83
## c.8  0.89
## c.9  0.81
## c.10 0.82
## c.11 0.70
## c.12 0.80
## c.13 0.82
## c.14 0.77
## c.15 0.74
## 
##                 PA1
## SS loadings    9.51
## Proportion Var 0.63

In addition, we create the plot of the item-latent trait relationships as shown below. The direction of the arrows leading from the latent trait to each variable underscores the assumption that the latent trait influences (give rise to) item responses. Again, the standardized factor loadings attached to each arrow indicates the strength of the item-latent trait relationship.

Another coefficient important to consider in model interpretations is an item’s communality. The term communality is derived from the fact that it is a measure of proportion of the variance in an item that is accounted for by common factors in the model. For a one-factor solution, a communality is simply the square of a standardized factor loading - it is, in effect, an R-square and is thus a measure of the item variance explained by the latent trait. Because our standardized loadings were large, the commonalities were also substantive.

##  c.1  c.2  c.3  c.4  c.5  c.6  c.7  c.8  c.9 c.10 c.11 c.12 c.13 c.14 c.15 
## 0.60 0.67 0.60 0.56 0.75 0.54 0.70 0.80 0.66 0.67 0.49 0.65 0.68 0.59 0.55

Finally, if the items in a scale show evidence of being unidimensional (having one-factor), psych has a function that computes indexes of unidimensionality and other useful summary indexes. Output from that function is shown below. The u = .99 index indicates there is a high degree of unidimensionality (possible values range from 0 to 1.0 with higher scores indicating higher levels of unidimensionality). In addition, coefficient alpha = .96 which suggests the items have a high degree of internal consistency reliability.

## 
## A measure of unidimensionality 
##  Call: unidim(x = data, cor = "poly")
## 
## Unidimensionality index = 
##        u av.r fit   fa.fit    alpha     av.r median.r Unidim.A 
##     0.99     0.99     1.00     0.96     0.63     0.62     1.00 
## 
## unidim adjusted index reverses negatively scored items.
## alpha    Based upon reverse scoring some items.
## average and median  correlations are based upon reversed scored items

Step 5: Assess and Interpret Results

Finally, we assess the results of our EFA following the steps in the process:

Determine the appropriate correlations to be used in the analysis
- Our Likert response scales require the use of ordinal procedures - all models were based on polychoric correlations
Assess the factorability of the correlation matrix
- Kaiser-Meyer-Olkin Test on Sampling Adequacy MSA = .97 plus high individual item MSA values indicated that our correlation matrix was factorable
Determine the number of factors to extract
- Eigenvalues and the scree test indicated that a one-factor model was plausible
Specify a factor model and conduct the factor analysis
- We elected to use the common factor principle axis factoring method based on polychoric correlations
Assess and interpret results
- Results from our principal axis factor model indicated that each item had a substantive relationship with the school climate latent variable; overall the items accounted for 63 % of common or shared variance
- The unidimensional function u = .99 indicated the items had a high degree of unidimensionality

This large sample EFA lends support to the viability of the scale as a measure of school climate. While the items vary in their relationship with the school climate latent variable, as a whole they work well together. It is important to note that the steps and decisions we followed in this example would apply to the analysis of items where there are two or more latent variables influencing responses. In that situation, we would have to make one additional decision about the rotation of a factor solution. We will cover the topic of rotation in a future monograph.

Additional Thoughts

Exploratory factor analysis is a venerable set of frameworks and methods that have evolved over decades. The EFA literature is expansive and growing. We recommend that EFA methods become a staple in the social work measurement research toolkit in addition to other factor analysis approaches using confirmatory factor analysis (CFA) and item response theory (IRT) methods. Two Using R in Social Work Research monographs provide useful information about using CFA and IRT:

Using R in Social Work Research: Item Response Theory (https://rpubs.com/JBean/840133)
Using R in Social Work Research: Confirmatory Factor Analysis (https://rpubs.com/JBean/723517)

Taken together, these resources present a fairly complete set of illustrations about how the R statistical computing environment and various R packages can be employed in our social work measurement work.