This document is one of a series designed to illustrate how the R statistical computing environment can be used to conduct various types of social work research. In this report, we present an example of using R to conduct an exploratory factor analysis (EFA) of a set of items designed to measure the perceptions of middle and high school students had about their school. The items evolved over a series of middle and high school needs assessment projects and the EFA was designed to assess the extent to which they measured a construct we call school climate.
In this example we use the psych package to fit and assess an EFA. The psych package includes an extensive set of tools and procedures designed to support and facilitate various measurement assessments - it is a veritable swiss army knife for psychometric studies. The package gives applied researchers, teachers, and statisticians access to free, open-source, and commercial quality methods for various measurement related statistical tasks and procedures. It rivals many proprietary statistical packages (SPSS, Stata, MPlus, EQS) in terms of features and capacities.
The EFA process is composed of the following steps and decisions:
The items in the scale are:
Items were measured on a Likert scale with the response categories Strongly disagree, Disagree, Can’t decide, Agree, and Strongly agree (using 1 - 5 numeric coding scheme).
The data used in the study came from 3,210 seventh grade students in seventeen school districts in a large mid-western urban county. As mentioned above, these data were collected as a part of a large-scale needs assessment designed to measure various student risk and protective factors. It is important to note that the items included in the scale were carefully vetted by school administrators, teachers, parents, and students and all were considered to be important aspects of school climate. The goal of the development process was to come up with a relatively short scale that could be used in needs assessments and evaluation studies.
First, we load the psych package plus a few other helpful packages as follows:
##Load Libraries
library(psych)
library(corrplot)
library(polycor)
We load the data file and examine the first few records:
## c.1 c.2 c.3 c.4 c.5 c.6 c.7 c.8 c.9 c.10 c.11 c.12 c.13 c.14 c.15
## 1 5 5 5 5 5 5 5 5 5 5 4 5 5 5 5
## 2 4 3 4 4 5 4 3 3 4 3 4 4 2 4 3
## 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## 4 4 3 4 4 5 5 5 5 5 5 1 3 4 4 5
## 5 1 4 1 1 4 5 4 4 4 1 4 1 4 4 1
## 6 4 2 3 4 4 4 3 4 4 4 2 4 3 4 3
Next, we examine basic descriptive statistics. This information can be quite helpful in determining if there are missing cases and if there are data irregularities you need to address such as high degrees of skewness and/or kurtosis. We see that there are no missing cases nor does there appear to be problems with distribution shape.
describe(data)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## c.1 1 3210 3.39 1.13 4 3.47 1.48 1 5 4 -0.61 -0.39 0.02
## c.2 2 3210 3.53 1.15 4 3.62 1.48 1 5 4 -0.63 -0.37 0.02
## c.3 3 3210 3.74 1.08 4 3.88 1.48 1 5 4 -0.88 0.31 0.02
## c.4 4 3210 3.68 1.10 4 3.80 1.48 1 5 4 -0.80 0.07 0.02
## c.5 5 3210 3.79 1.09 4 3.94 1.48 1 5 4 -0.96 0.42 0.02
## c.6 6 3210 3.92 1.02 4 4.06 1.48 1 5 4 -1.00 0.74 0.02
## c.7 7 3210 3.57 1.12 4 3.67 1.48 1 5 4 -0.68 -0.15 0.02
## c.8 8 3210 3.72 1.05 4 3.84 1.48 1 5 4 -0.89 0.44 0.02
## c.9 9 3210 3.57 1.14 4 3.68 1.48 1 5 4 -0.65 -0.26 0.02
## c.10 10 3210 3.51 1.17 4 3.62 1.48 1 5 4 -0.62 -0.34 0.02
## c.11 11 3210 3.18 1.23 3 3.23 1.48 1 5 4 -0.26 -0.87 0.02
## c.12 12 3210 3.55 1.09 4 3.65 1.48 1 5 4 -0.64 -0.02 0.02
## c.13 13 3210 3.48 1.11 4 3.56 1.48 1 5 4 -0.54 -0.28 0.02
## c.14 14 3210 3.82 0.99 4 3.94 1.48 1 5 4 -0.98 0.93 0.02
## c.15 15 3210 3.34 1.13 3 3.41 1.48 1 5 4 -0.40 -0.41 0.02
Because our measures are ordinal, we compute a matrix of polychoric correlations that we will then use in the analyses. We examine the correlations in a table:
## c.1 c.2 c.3 c.4 c.5 c.6 c.7 c.8 c.9 c.10 c.11 c.12 c.13 c.14 c.15
## c.1 1.00 0.73 0.65 0.58 0.68 0.55 0.67 0.68 0.62 0.64 0.56 0.62 0.59 0.54 0.54
## c.2 0.73 1.00 0.69 0.62 0.76 0.61 0.67 0.73 0.65 0.65 0.54 0.62 0.64 0.59 0.57
## c.3 0.65 0.69 1.00 0.58 0.68 0.58 0.69 0.70 0.62 0.63 0.51 0.65 0.59 0.56 0.52
## c.4 0.58 0.62 0.58 1.00 0.69 0.61 0.60 0.67 0.59 0.60 0.52 0.59 0.62 0.59 0.53
## c.5 0.68 0.76 0.68 0.69 1.00 0.70 0.72 0.80 0.69 0.67 0.56 0.66 0.72 0.66 0.62
## c.6 0.55 0.61 0.58 0.61 0.70 1.00 0.60 0.67 0.60 0.56 0.45 0.61 0.58 0.58 0.52
## c.7 0.67 0.67 0.69 0.60 0.72 0.60 1.00 0.82 0.69 0.71 0.59 0.67 0.65 0.60 0.56
## c.8 0.68 0.73 0.70 0.67 0.80 0.67 0.82 1.00 0.76 0.73 0.59 0.70 0.70 0.66 0.64
## c.9 0.62 0.65 0.62 0.59 0.69 0.60 0.69 0.76 1.00 0.69 0.59 0.64 0.64 0.60 0.61
## c.10 0.64 0.65 0.63 0.60 0.67 0.56 0.71 0.73 0.69 1.00 0.70 0.67 0.66 0.60 0.59
## c.11 0.56 0.54 0.51 0.52 0.56 0.45 0.59 0.59 0.59 0.70 1.00 0.58 0.59 0.53 0.54
## c.12 0.62 0.62 0.65 0.59 0.66 0.61 0.67 0.70 0.64 0.67 0.58 1.00 0.70 0.63 0.60
## c.13 0.59 0.64 0.59 0.62 0.72 0.58 0.65 0.70 0.64 0.66 0.59 0.70 1.00 0.75 0.73
## c.14 0.54 0.59 0.56 0.59 0.66 0.58 0.60 0.66 0.60 0.60 0.53 0.63 0.75 1.00 0.73
## c.15 0.54 0.57 0.52 0.53 0.62 0.52 0.56 0.64 0.61 0.59 0.54 0.60 0.73 0.73 1.00
We see in the table that very few of the pairwise correlations fall below .50. We consider the correlations to be large.
Next, we compute a Kaiser-Meyer-Olkin (KMO) Test on Sampling Adequacy to assess the factorability of our correlation matrix. The KMO tests to see if the partial correlations in the data are close enough to zero to suggest there is at least one latent factor underlying our items. A KMO has a possible range of 0 - 1.00 with higher KMO values reflecting higher levels of factorability. Our Overall MSA (measure of sample adequacy) = .97 is excellent as are the MSA coefficients for each item. We continue with our factoring.
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = poly_values$rho)
## Overall MSA = 0.97
## MSA for each item =
## c.1 c.2 c.3 c.4 c.5 c.6 c.7 c.8 c.9 c.10 c.11 c.12 c.13 c.14 c.15
## 0.97 0.97 0.98 0.98 0.97 0.98 0.96 0.96 0.98 0.97 0.96 0.98 0.96 0.96 0.96
To determine the number of factors it is useful to examine eigenvalues and to display them in a scree plot. Eigenvalues are measures of the variance accounted for by a factor. An examination of eigenvalues can provide insights into how many factors a set of correlations contains. We display the first five eigenvalues below. There is a substantial difference between the first eigenvalue = 9.865 and the second eigenvalue = 0.774 suggesting that a one factor solution should be explored.
## [1] 9.87 0.77 0.66 0.48 0.44
The scree plot visually presents eigenvalue data. The large difference between the first eigenvalue all all others is dramatic. The usual interpretation of a scree plot is to look for an ‘elbow’ or a large drop-off in values that provides information about the number of potential factors. Our scree plot affirms what we already know about our eigenvalues - there is one dominate value indicating that one factor is plausible.
We now proceed with the actual factor analysis. One of the first decisions in this step is to decide which factor model to use. In general, we can consider using a principle components model or a common factor model. (Note: The literature about these models is extensive and we leave it to the reader to read about the differences). Since our goal is to examine the relationship between our items and the school climate latent variable, we choose a common factor model. More specifically, we decide to fit the model using principal axis factoring methods specifying a one-factor model.
The standardized factor loadings from our principal axis factoring output are shown below. The loadings (PA1) are partial regression coefficients and express the strength of the relationship between an item and the latent trait. While there is some variability in recommended interpretation thresholds, a value of .37 or larger has been suggested as a substantive or practically significant cut-off. Given that reference value, all of the loadings would be considered quite large. They range from .700 (c.1 - My school is an exciting place) to .894 (c.8 - My school is supportive). In addition, the Proportion Var = .63 indicates that 63.0% of the item common variance is accounted for by the factor. While there is no suggested standard for interpretation, it is a ‘higher is better’ measure. We consider our value to be substantive.
##
## Loadings:
## PA1
## c.1 0.78
## c.2 0.82
## c.3 0.78
## c.4 0.75
## c.5 0.87
## c.6 0.74
## c.7 0.83
## c.8 0.89
## c.9 0.81
## c.10 0.82
## c.11 0.70
## c.12 0.80
## c.13 0.82
## c.14 0.77
## c.15 0.74
##
## PA1
## SS loadings 9.51
## Proportion Var 0.63
In addition, we create the plot of the item-latent trait relationships as shown below. The direction of the arrows leading from the latent trait to each variable underscores the assumption that the latent trait influences (give rise to) item responses. Again, the standardized factor loadings attached to each arrow indicates the strength of the item-latent trait relationship.
Another coefficient important to consider in model interpretations is an item’s communality. The term communality is derived from the fact that it is a measure of proportion of the variance in an item that is accounted for by common factors in the model. For a one-factor solution, a communality is simply the square of a standardized factor loading - it is, in effect, an R-square and is thus a measure of the item variance explained by the latent trait. Because our standardized loadings were large, the commonalities were also substantive.
## c.1 c.2 c.3 c.4 c.5 c.6 c.7 c.8 c.9 c.10 c.11 c.12 c.13 c.14 c.15
## 0.60 0.67 0.60 0.56 0.75 0.54 0.70 0.80 0.66 0.67 0.49 0.65 0.68 0.59 0.55
Finally, if the items in a scale show evidence of being unidimensional (having one-factor), psych has a function that computes indexes of unidimensionality and other useful summary indexes. Output from that function is shown below. The u = .99 index indicates there is a high degree of unidimensionality (possible values range from 0 to 1.0 with higher scores indicating higher levels of unidimensionality). In addition, coefficient alpha = .96 which suggests the items have a high degree of internal consistency reliability.
##
## A measure of unidimensionality
## Call: unidim(x = data, cor = "poly")
##
## Unidimensionality index =
## u av.r fit fa.fit alpha av.r median.r Unidim.A
## 0.99 0.99 1.00 0.96 0.63 0.62 1.00
##
## unidim adjusted index reverses negatively scored items.
## alpha Based upon reverse scoring some items.
## average and median correlations are based upon reversed scored items
Finally, we assess the results of our EFA following the steps in the process:
This large sample EFA lends support to the viability of the scale as a measure of school climate. While the items vary in their relationship with the school climate latent variable, as a whole they work well together. It is important to note that the steps and decisions we followed in this example would apply to the analysis of items where there are two or more latent variables influencing responses. In that situation, we would have to make one additional decision about the rotation of a factor solution. We will cover the topic of rotation in a future monograph.
Exploratory factor analysis is a venerable set of frameworks and methods that have evolved over decades. The EFA literature is expansive and growing. We recommend that EFA methods become a staple in the social work measurement research toolkit in addition to other factor analysis approaches using confirmatory factor analysis (CFA) and item response theory (IRT) methods. Two Using R in Social Work Research monographs provide useful information about using CFA and IRT:
Taken together, these resources present a fairly complete set of illustrations about how the R statistical computing environment and various R packages can be employed in our social work measurement work.