1. Classical Test Theory
In this lecture I explain and illustrate concepts from measurement
theory, with the focus being on reliability. Classical test theory (CTT)
has been the foundation of psychological measurement for many years.
Although measurement theory has evolved over the years to include
Generalizability Theory and Item Response Theory, I will limit my
discussion to CTT as it dovetails nicely with the basic concepts found
in confirmatory factor analysis (CFA). Much of the introduction is
borrowed from a paper I published in 2013. I will use the nouns
instrument, measure, and scale interchangeably to refer to the means by
which we obtain data from people. I created this document in R
Markdown.
1.1 The Beginning
One of the most fundamental tenets of
measurement theory was first proposed by Carl Spearman in 1904 when
working to develop a means to measure individual differences in
intelligence, a stable person characteristic, or trait. Spearman
conceived of every measurement, or observed score, \(X_i\), as consisting of two components, a
true score on the construct of interest, \(T_i\), and an error score, \(e_i\), so we can write:
\[
X_i = T_i + e_i \tag{1}
\]
Because they contain error, observed values of \(X_i\) are considered fallible. The true
score, \(T_i\), is the score that would
be obtained under ideal or perfect conditions of measurement. Spearman’s
formulation has become known as true-score theory or classic measurement
theory (CTT). Because both \(T_i\) and
\(e_i\) are unknowns, this formula
cannot be used to estimate the measurement error in the observed scores
without further assumptions.
Assumption 1: for an individual, the
construct being measured is constant (over some specified time period)
and the errors in measurement are random. This suggests that if an
individual were to be measured an infinite number of times a series of
X values would result, each consisting of the same true score
but differing due to different error scores. Being random, the expected
value of the error scores is zero, \(E(e_i) =
0\). Assumption 2: given assumption 1, the true score is
equal to the expected value of the observed scores over an infinite
number of repeated measurements (made under similar conditions), \(T_i = E(X_i)\). Assumption 3:
observed differences among individuals may be due to differences in
their true scores or due to differences in their error scores. This
implies that the variance of observed scores is a composite of the
variance of true scores and the variance of error scores:
\[
\sigma^2_X = \sigma^2_T + \sigma^2_e \tag{2}
\] and a little algebra shows that \[
\frac{\sigma^2_T}{\sigma^2_X} = 1 - \frac{\sigma^2_e}{\sigma^2_X}
\tag{3}
\]
This last piece, the extent to which a set of measurements is free
from random error variance is reliability. As a
proportion, reliability can range from 0 to 1; it equals 1 when all the
observed variance in a set of measurements is due only to true-score
variance, that is, when there are no random errors of measurement, and
it equals 0 when all the observed variance is due to random error
variance. This definition implies that some measurement errors can be
random while others are systematic. When measurement error is
systematic, it is referred to as bias. In the remainder of this lecture
we focus on random measurement error.
1.2 Covariance and Correlation
Variance is a summary statistic that indicates the variation in a set
of scores from their mean (and standard deviation is the square root of
variance). Covariance indicates the covariation of two sets of scores
from their respective means. Covariance, and correlation have been, and
continue to be extremely useful in measurement theory. To briefly
illustrate covariance assume that we have measured three people on two
variables x and y; their x scores are: 2, 4,
and 6, and their y scores are: 7, 5, and 9. The means are
readily calculated as 4 for x and 7 for y. If we
express each person’s scores as a deviation from the respective means we
get -2, 0, 2 for x and 0, -2, 2 for y, respectively.
If we multiply the pairs of deviation scores, total them up (-2)(0) +
(0)(-2) + (2)(2) = 4, and divide by the number of people contributing
scores, 4/3, we obtain covariance, 1.33. Covariance can be either
positive or negative, can range from negative to positive infinity, and
its magnitude is dependent on the metric(s) of the two variables
involved. Therefore it is difficult to know whether 1.33 represents a
lot or a little covariation.
A more convenient metric for expressing
covariation is the correlation coefficient \(r\). The correlation coefficient has the
advantage that it is standardized; it is equal to 1.0 when there is
perfect positive covariation between the two sets of scores, -1.0 when
there is perfect negative covariation between the two sets of scores,
and equal to 0.0 when there is no covariation. If a covariance is
divided by the product of the standard deviations for x and
y, we obtain \(r(x, y)\), the
correlation between x and y. In the current example
\(r(x, y) = 1.33/(1.63)(1.63) = .50\).
For this reason, we often refer to a correlation as a standardized
covariance. Another useful property of the correlation coefficient
is that its square represents the proportion of variance shared between
x and y. In this example \(r(x, y)^2 = .25\), so we may state that 25%
of the variance in x is shared with y, and 25% of the
variance in y is shared with x.
1.3 Reliability Defined
According to CTT, reliability is the proportion
of variance in a set of measurements associated with true-score
variance. In other words if we could correlate true scores with observed
scores and square the result as, \(r(T_i,
X_i)^2\), this would indicate the proportion of variance in the
observed scores that is shared with the true scores. We can now write a
formal definition:
\[
reliability = r(T_i, X_i)^2 = \frac{\sigma^2_T}{\sigma^2_X} = 1 -
\frac{\sigma^2_e}{\sigma^2_X} \tag{4}
\]
Recall that true scores are never known, and so we must derive a
means of estimating \(r(T_i, X_i)^2\)
from observable scores.
Measurement theorists (psychometricians) have
studied random measurement error for a long time. In two seminal papers
published in the same issue of the British Journal of
Psychology in 1910, Carl Spearman and William Brown showed that
when multiple measurements of a construct (e.g., intelligence, anxiety,
emotional well-being, etc.) are obtained from the same individual and
combined, the ratio of true-score variance to observed-score variance in
the composite score increases. In other words, one way to reduce random
measurement error is to create instruments that contain multiple
measurements and to aggregate these measurements together when defining
the observed score for an individual. Spearman and Brown also showed
that as the number of similar measurements being combined increases, so
does the reliability of the composite scores. In essence, this is why
today most psychological scales are formed by aggregating responses to
multiple items.
1.4 Cronbach’s Coefficient \(\alpha\)
One way of estimating reliability from observed scores on a
multi-item scale is Cronbach’s \(\alpha\). Cronbach developed this
coefficient in 1951 before CFA existed. If we consider a composite score
made up of \(k\) items, then \(\alpha\) is defined as the ratio of
inter-item covariance to composite-score variance weighted by \(k/(k - 1)\). The weight, \(k/(k - 1)\), corrects for the proportion of
variance in any item which is due to the same elements as the
covariance. In other words, \(\alpha\)
is an estimate of the proportion of true-score variance in the
multi-item scale scores. As a proportion it can range from 0.0 to 1.0,
and will only be equal to 1.0 when all the variance in observed scores
is true-score variance. Cronbach’s \(\alpha\) is easily calculated from the item
covariance matrix \(\mathbf C\).
\[\mathbf C =
\begin{bmatrix}
\sigma^2_{1,1} & \sigma_{1,2} & \cdots & \sigma_{1,k} \\
\sigma_{2,1} & \sigma^2_{2,2} & \cdots & \sigma_{2,k} \\
\vdots & \vdots & \ddots & \vdots \\
\sigma_{k,1} & \sigma_{k,2} & \cdots & \sigma^2_{k,k}
\end{bmatrix}\]
The diagonal elements are item variances and the off-diagonal
elements are covariances. The ratio of inter-item covariance to
composite-score variance is equal to 1 minus the ratio of the sum of the
item variances (diagonal elements of \(\mathbf
C\)) to the sum of all the elements in \(\mathbf C\). This ratio first is
calculated, then subtracted from 1, and finally multiplied by \(k/(k - 1)\).
\[
\alpha =\frac{k}{k - 1} \left[1-\frac{\sum_{i=1}^k \sigma^2_{ii}}
{\sum_{i=1}^k \sigma^2_{ii} + 2\sum_{i<j}^k\sigma_{ij}} \right]
\tag{5}
\] Although Cronbach’s \(\alpha\) is by far the most commonly
reported coefficient of reliability, there are others. One of these is
\(\omega\) which can be calculated from
CFA parameter estimates. More on this below.
2. The CFA Model
CFA is a type of structural equation model (SEM) that focuses on the
relations between observed variables and the latent variables (i.e.,
factors) hypothesized to underlie them. I will not go into the math
here. CFA models can be used estimate reliability. When working with SEM
we often rely on diagrams for presenting our hypothesized model and
results. These path diagrams represent observed variables as squares,
latent variables as circles, and causal influence with arrows. Figure 1
is a very basic path diagram illustrating CTT; the variance in the
observed variable X is the sum of unobserved True-score
variance and error variance.
lx<-matrix(1,1,1)
td<-diag(1,nrow(lx))
ph<-matrix(1,1,1)
mod0 <- lisrelModel(LX=lx, TD=td, PH=ph)
semPaths(mod0, as.expression=c("nodes","edges"), sizeMan = 8,
sizeLat = 10, style="lisrel", rotation = 2,
residScale=15, edge.color="black",
edge.label.cex = 1.5, label.cex=1.5, nodeLabels=c("X", "T"), edgeLabels=c("","e"))
Before we consider reliability and CFA models, we need a bit more theory
for dealing with several correlations at the same time.
2.1 Path Analysis Theory
Path Analysis (PA) is a method for decomposing the correlation
coefficients within a system of causally related variables. PA applies
multiple linear regression and was developed by zoologist Sewall Wright
in 1934. Path coefficients are standardized regression coefficients
obtained from a set of inter-related linear regression models. PA
allows/requires the investigator to theorize about the causal relations
among a set of variables and apply his/her thinking to decompose the
correlations among the variables.
To illustrate, consider two variables \(x\) and \(y\) that are correlated. We may theorize
about why they are correlated using PA. They could correlate
because \(x\) causes \(y\) \((x
\rightarrow y)\), or because \(y\) causes \(x\) \((y
\rightarrow x)\), or because \(x\) and \(y\) share a common cause, \(z\), \((x
\leftarrow z \rightarrow y)\). The first two “theories” are
referred to as direct effects, the third one is called a spurious
effect.
PA makes certain assumptions:
1. the relations among the variables are linear,
additive, and causal.
2. The residual terms are uncorrelated. That is,
each residual is not correlated with the variables that precede it in
the model.
3. There is a one-way causal flow in the system.
Reciprocal causation between variables (i.e., feedback loops) is ruled
out.
4. The variables are measured on an
interval-level scale.
5. The variables are measured without random
error.
These assumptions provide part of the framework for applying
structural equation models. We should keep them in mind.
2.2 CTT & CFA
CFA estimates the correlations among observed variables in terms of
underlying latent variables. This is convenient because we can now
represent our CTT-based measurement model in mathematical terms (as a
CFA model) and test it on data. Consider a generic case. We want to
measure individual differences in some construct (anxiety for example).
We know CTT and so we create multiple (three) statements (survey items)
that on the face of it capture the construct of interest. Several
individuals then read the the statements and indicate how much they
agree, or disagree with each. The data can be summarized in a 3x3
covariance matrix.
CTT leads us to hypothesize that the variances
of the three items are influenced by one underlying latent variable
representing our construct, We can map this hypothesis into a CFA model
where the three items \(x_1\), \(x_2\), and \(x_3\) are indicators of one underlying
latent variable, \(\xi_1\). Our model
is shown as a path diagram in Figure 2.
Specifying a generic CFA the model using LISREL notation.
# creating 1 factor w 3 indicators
# Lambda matrices:
lambdax <- matrix(0,3,1)
lambdax[1:3,1] <-1
# Phi matrix:
LatX <- matrix(1,1,1) #phi
# Theta matrix:
thd <- diag(1,nrow(lambdax))
# Combine matrices into a model:
mod1 <- lisrelModel(LX=lambdax, TD=thd, PH=LatX)
Plotting CFA path diagram:
# Plot path diagram using semPlot::semPaths sizelat=4, sizeman=3
#--making path diagram ---------------------mar(bot,lft,top,rgt) was M 6 L 8
semPaths(mod1, as.expression=c("nodes","edges"), sizeMan = 9,
sizeLat = 12, style="lisrel", rotation = 3,
residScale=14, edge.color="black",
edge.label.cex = 1.5, label.cex=1.5)
Figure 2 is the graphicical representation of
our measurement model using LISREL notation. Squares denote observed
variables (indicators) and the circle represents the latent variable or
construct of interest. Each indicator has a coefficient \(\lambda_{i1}\) (lambda) linking it to the
latent variable \(\xi_1\) representing
our construct and a coefficient \(\theta_{ii}\) (theta) expressing its
measurement error variance. PA tells us that two variables may be
correlated because they share a common cause. For example, because \(x_1\) and \(x_2\) share a common cause \(\xi_1\), our model implies that \(r(x_1 x_2) = \lambda_{11}\lambda_{21}\).
The same can be said about the other two correlations, \(r(x_1,x_3) = \lambda_{11}\lambda_{31}\) and
\(r(x_2,x_3) =
\lambda_{21}\lambda_{31}\). If the estimated correlations from
our CFA model are consistent with the observed correlations, there is
support for the hypothesis that the three items are measuring the same
underlying construct.
2.3 Reliability Coefficient \(\omega\)
One way to interpret the \(\lambda_{ij}\) coefficients is to think of
them as correlations between the indicator variables and the latent
variable. Squaring them tells us the proportion of variance shared
between the item and the latent construct it is hypothesized to measure.
In the CTT framework this is the proportion of true-score variance in
the observed variable, so larger values (approaching 1.0) are ideal. We
can use the parameters from our CFA model to estimate the reliability of
our multi-item scale using another index \(\omega\) (omega) which is defined as:
\[
\omega = \frac{\sum^k_{i=1}\lambda_i^2}
{\sum^k_{i=1}\lambda_i^2 + \sum^k_{i=1}\theta_{ii}} \tag{6}
\] The numerator represents true-score variance and the
denominator represents observed variance. Equation 6 can be applied to
standardized coefficients (as is done here) and to undstandardized
coefficients.
2.4 CFA using lavaan
Specifying models in lavaan is pretty straightforward.
There are different operators for specifying various relations among
variables. Latent variables are defined using the =~
operator with the latent variable on the left and the indicator
variables on the right. In the example below I show the R code for the
how to specify and run a very simple CFA model in lavaan.
One nice feature is that we don’t always need the raw data file; we can
work from the covariance matrix, or correlation matrix plus a vector of
standard deviations and the sample size.
3. An Example
This example involves three items hypothesized to measure anxiety
related to COVID-19:
ax1: I am afraid of COVID-19.
ax2: Thinking about COVID-19 makes me feel threatened.
ax3: I am stressed around people because I worry I’ll catch
COVID-19.
The response options were: 1 ‘Strongly Disagree’, 2 ‘Disagree’, 3
‘Neither Agree nor Disagree’, 4 ‘Agree’, and 5 ‘Strongly Agree’. These
items were included in a survey of faculty, staff, and students
(N = 38,921) at a major university in the Southeast United
States during the first few weeks of the pandemic in 2020.
Reading in the data:
# reading in corr matrix
corr <- '
1.000
.728 1.000
.714 .713 1.000 '
# add variable names and convert to full correlation matrix
corrfull <- lavaan::getCov(corr, names=c("ax1","ax2","ax3"))
# add SDs and convert to full covariance matrix
covfull <- lavaan::cor2cov(corrfull, sds=c(1.163, 1.127, 1.169))
# observed covariance matrix N=38921
# vector of means that may be needed (scores range from 1 to 5)
Mns<- c(3.26, 2.76, 3.00)
3.1 Correlation Table
Using sjPlot::tab_corr to create this nice table.
# Uses sjPlot to print a nice looking correlation table
sjPlot::tab_corr(corrfull, na.deletion = "listwise", digits = 3, triangle = "lower",
title = "Correlations among observed variables (N =38,921)",
string.diag=c('1.000','1.000','1.000'))
Correlations among observed variables (N =38,921)
|
|
ax1
|
ax2
|
ax3
|
|
ax1
|
1.000
|
|
|
|
ax2
|
0.728
|
1.000
|
|
|
ax3
|
0.714
|
0.713
|
1.000
|
|
Computed correlation used pearson-method with listwise-deletion.
|
ax1: I am afraid of COVID-19.
ax2: Thinking about COVID-19 makes me feel threatened.
ax3: I am stressed around people because I worry I’ll catch
COVID-19.
CCT posits that because these items are measurements of the same
construct, responses to them should be correlated. Notice that the
correlations range from .713 to .728. Consider the correlation between
the first two items \(r(ax1, ax2) =
.728\). If we square this correlation we get the proportion of
variance that these two items share \(r^2(ax1,
ax2) = .728^2 = 530\).
3.2 CFA of COVID Anxiety Scale
Specify the model:
# specify the CFA model
model1 <- ' COVID_Axty =~ ax1 + ax2 + ax3 '
Running the model:
# fit the path model
fit <- lavaan::cfa(model1, sample.cov=covfull, sample.nobs=38921, std.lv=T)
3.3 Results
Plotting results as path diagram:
# making path CFA diagram
fig3 <- semPlot::semPaths(fit, whatLabels = "std", layout = "tree2",
rotation = 1, style = "lisrel", optimizeLatRes = TRUE, nDigits=3,
structural = FALSE, layoutSplit = FALSE,
intercepts = FALSE, residuals = T,
curve = 3, curvature = 3, nCharNodes = 8,
sizeLat = 24, sizeLat2 = 12, sizeMan = 9,
edge.label.cex = 1.5, label.cex=1.0, residScale=14,
edge.color = "black", edge.label.position = .40, DoNotPlot=T)
#--working with semptools to modify path diagrams
fig3_1 <- fig3 |> semptools::mark_sig(fit, alpha = c("(n.s.)" = 1.00, "*" = .05)) #adding *, ns
plot(fig3_1)
ax1: I am afraid of COVID-19.
ax2: Thinking about COVID-19 makes me feel threatened.
ax3: I am stressed around people because I worry I’ll catch
COVID-19.
Let’s examine the parameter estimates from our model shown in Figure 3.
The \(\lambda\) for ax1 is
.854 and the \(\lambda\) for
ax2 is .853. These are the estimated correlations of these
items with the underlying construct COVID Anxiety. If we think about the
notion of spurious effects from PA, we can say that the reason responses
to these two items are so highly correlated is because they are both
being strongly influenced by the same underlying construct,
COVID-Anxiety. If we multiply these two \(\lambda\)s together we get .728 which is
the observed correlation \(r(ax1,ax2)\)
shown in the correlation table above. The same can be done to reproduce
the other two correlations.
The three \(\theta\)s at the bottom of
Figure 3 show the estimated proportion of measurement error variance in
each item. If considered separately, rather than as a composite, roughly
28% of the variance in each item is measurement error. Squaring the
\(\lambda\)s gives the proportion of
true-score variance being “captured” by each item. Note that because all
coefficients are standardized we have \(\lambda^2 + \theta = 1.0\) for each
item
Full output:
I used the summary() function to
write lavaan results to the console. There is a lot of
output to review. The first bit is model fit statistics. Because this
model is just-identified (there are as many parameters as there are data
elements in the covariance matrix) the degrees of freedom are 0 and the
model fit is perfect. Next we see Parameter Estimates
beginning with the Latent Variables section that shows the factor
loadings \(\lambda_{ij}\). The column
labeled Estimate gives the unstandardized coefficients, and the
column Std.all gives the completely standardized coefficients
that are shown in Figure 3. Next come the Variances. Variables prefixed
with a “.” indicate residual terms so these are the \(\theta_{ii}\) estimates. Finally, we see
the \(R^2\) values indicating the
proportion of variance that each item shares with the latent
variable.
summary(fit, fit.measures=T, standardized=T, rsquare=T)
## lavaan 0.6.15 ended normally after 12 iterations
##
## Estimator ML
## Optimization method NLMINB
## Number of model parameters 6
##
## Number of observations 38921
##
## Model Test User Model:
##
## Test statistic 0.000
## Degrees of freedom 0
##
## Model Test Baseline Model:
##
## Test statistic 64012.579
## Degrees of freedom 3
## P-value 0.000
##
## User Model versus Baseline Model:
##
## Comparative Fit Index (CFI) 1.000
## Tucker-Lewis Index (TLI) 1.000
##
## Loglikelihood and Information Criteria:
##
## Loglikelihood user model (H0) -150279.741
## Loglikelihood unrestricted model (H1) -150279.741
##
## Akaike (AIC) 300571.482
## Bayesian (BIC) 300622.898
## Sample-size adjusted Bayesian (SABIC) 300603.830
##
## Root Mean Square Error of Approximation:
##
## RMSEA 0.000
## 90 Percent confidence interval - lower 0.000
## 90 Percent confidence interval - upper 0.000
## P-value H_0: RMSEA <= 0.050 NA
## P-value H_0: RMSEA >= 0.080 NA
##
## Standardized Root Mean Square Residual:
##
## SRMR 0.000
##
## Parameter Estimates:
##
## Standard errors Standard
## Information Expected
## Information saturated (h1) model Structured
##
## Latent Variables:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## COVID_Axty =~
## ax1 0.993 0.005 198.586 0.000 0.993 0.854
## ax2 0.961 0.005 198.199 0.000 0.961 0.853
## ax3 0.978 0.005 192.966 0.000 0.978 0.836
##
## Variances:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## .ax1 0.367 0.004 85.404 0.000 0.367 0.271
## .ax2 0.347 0.004 85.954 0.000 0.347 0.273
## .ax3 0.411 0.004 93.077 0.000 0.411 0.301
## COVID_Axty 1.000 1.000 1.000
##
## R-Square:
## Estimate
## ax1 0.729
## ax2 0.727
## ax3 0.699
3.4 Reliability Analysis
The semTools package has a function
compRelSEM that reads the lavaan output and
returns the reliability estimates for each latent variable. Setting the
options ‘tau.eq=T’ and ‘obs.var=T’ produces equivalent to Cronbach’s
reliability coefficient \(\alpha\)
(alpha). Setting ‘tau.eq=F’ produces the \(\omega\) (omega) coefficient of
reliability. In this exmaple the two are quite similar.
relalpha<-semTools::compRelSEM(fit, tau.eq=T, obs.var=T)
The Cronbach’s \(\alpha\) reliability for COVID-Anxiety
Scale (using eq. 5) is 0.8842327.
relomega<-semTools::compRelSEM(fit, tau.eq=F, obs.var=T)
The \(\omega\)
reliability coefficient for COVID-Anxiety Scale (using eq. 6) is
0.8843122.
5. References
Beckstead, J.W. (2013). On measurements and
their quality: Paper 1. Reliability - history, issues and procedures.
International Journal of Nursing Studies, 50 (7),
968-973.
Brown, W., (1910). Some experimental results in the correlation of
mental abilities. British Journal of Psychology 3,
296–322.
Cronbach, L. J. (1951). Coefficient alpha and the internal structure of
tests. Psychometrika, 16 (3), 297–334.
Spearman, C., (1904). General intelligence: objectively determined and
measured. American Journal of Psychology 15, 201–293.
Spearman, C., (1910). Correlation calculated with faulty data.
British Journal of Psychology 3, 271–295.
I Wright, S. (1934). The method of path coefficients. Annals of
Mathematical Statistics, 5, 161-215.
