1. Classical Test Theory

In this lecture I explain and illustrate concepts from measurement theory, with the focus being on reliability. Classical test theory (CTT) has been the foundation of psychological measurement for many years. Although measurement theory has evolved over the years to include Generalizability Theory and Item Response Theory, I will limit my discussion to CTT as it dovetails nicely with the basic concepts found in confirmatory factor analysis (CFA). Much of the introduction is borrowed from a paper I published in 2013. I will use the nouns instrument, measure, and scale interchangeably to refer to the means by which we obtain data from people. I created this document in R Markdown.

1.1 The Beginning

One of the most fundamental tenets of measurement theory was first proposed by Carl Spearman in 1904 when working to develop a means to measure individual differences in intelligence, a stable person characteristic, or trait. Spearman conceived of every measurement, or observed score, \(X_i\), as consisting of two components, a true score on the construct of interest, \(T_i\), and an error score, \(e_i\), so we can write:

\[ X_i = T_i + e_i \tag{1} \]

Because they contain error, observed values of \(X_i\) are considered fallible. The true score, \(T_i\), is the score that would be obtained under ideal or perfect conditions of measurement. Spearman’s formulation has become known as true-score theory or classic measurement theory (CTT). Because both \(T_i\) and \(e_i\) are unknowns, this formula cannot be used to estimate the measurement error in the observed scores without further assumptions.

Assumption 1: for an individual, the construct being measured is constant (over some specified time period) and the errors in measurement are random. This suggests that if an individual were to be measured an infinite number of times a series of X values would result, each consisting of the same true score but differing due to different error scores. Being random, the expected value of the error scores is zero, \(E(e_i) = 0\). Assumption 2: given assumption 1, the true score is equal to the expected value of the observed scores over an infinite number of repeated measurements (made under similar conditions), \(T_i = E(X_i)\). Assumption 3: observed differences among individuals may be due to differences in their true scores or due to differences in their error scores. This implies that the variance of observed scores is a composite of the variance of true scores and the variance of error scores:

\[ \sigma^2_X = \sigma^2_T + \sigma^2_e \tag{2} \] and a little algebra shows that \[ \frac{\sigma^2_T}{\sigma^2_X} = 1 - \frac{\sigma^2_e}{\sigma^2_X} \tag{3} \]

This last piece, the extent to which a set of measurements is free from random error variance is reliability. As a proportion, reliability can range from 0 to 1; it equals 1 when all the observed variance in a set of measurements is due only to true-score variance, that is, when there are no random errors of measurement, and it equals 0 when all the observed variance is due to random error variance. This definition implies that some measurement errors can be random while others are systematic. When measurement error is systematic, it is referred to as bias. In the remainder of this lecture we focus on random measurement error.

1.2 Covariance and Correlation

Variance is a summary statistic that indicates the variation in a set of scores from their mean (and standard deviation is the square root of variance). Covariance indicates the covariation of two sets of scores from their respective means. Covariance, and correlation have been, and continue to be extremely useful in measurement theory. To briefly illustrate covariance assume that we have measured three people on two variables x and y; their x scores are: 2, 4, and 6, and their y scores are: 7, 5, and 9. The means are readily calculated as 4 for x and 7 for y. If we express each person’s scores as a deviation from the respective means we get -2, 0, 2 for x and 0, -2, 2 for y, respectively. If we multiply the pairs of deviation scores, total them up (-2)(0) + (0)(-2) + (2)(2) = 4, and divide by the number of people contributing scores, 4/3, we obtain covariance, 1.33. Covariance can be either positive or negative, can range from negative to positive infinity, and its magnitude is dependent on the metric(s) of the two variables involved. Therefore it is difficult to know whether 1.33 represents a lot or a little covariation.

A more convenient metric for expressing covariation is the correlation coefficient \(r\). The correlation coefficient has the advantage that it is standardized; it is equal to 1.0 when there is perfect positive covariation between the two sets of scores, -1.0 when there is perfect negative covariation between the two sets of scores, and equal to 0.0 when there is no covariation. If a covariance is divided by the product of the standard deviations for x and y, we obtain \(r(x, y)\), the correlation between x and y. In the current example \(r(x, y) = 1.33/(1.63)(1.63) = .50\). For this reason, we often refer to a correlation as a standardized covariance. Another useful property of the correlation coefficient is that its square represents the proportion of variance shared between x and y. In this example \(r(x, y)^2 = .25\), so we may state that 25% of the variance in x is shared with y, and 25% of the variance in y is shared with x.

1.3 Reliability Defined

According to CTT, reliability is the proportion of variance in a set of measurements associated with true-score variance. In other words if we could correlate true scores with observed scores and square the result as, \(r(T_i, X_i)^2\), this would indicate the proportion of variance in the observed scores that is shared with the true scores. We can now write a formal definition:

\[ reliability = r(T_i, X_i)^2 = \frac{\sigma^2_T}{\sigma^2_X} = 1 - \frac{\sigma^2_e}{\sigma^2_X} \tag{4} \]

Recall that true scores are never known, and so we must derive a means of estimating \(r(T_i, X_i)^2\) from observable scores.

Measurement theorists (psychometricians) have studied random measurement error for a long time. In two seminal papers published in the same issue of the British Journal of Psychology in 1910, Carl Spearman and William Brown showed that when multiple measurements of a construct (e.g., intelligence, anxiety, emotional well-being, etc.) are obtained from the same individual and combined, the ratio of true-score variance to observed-score variance in the composite score increases. In other words, one way to reduce random measurement error is to create instruments that contain multiple measurements and to aggregate these measurements together when defining the observed score for an individual. Spearman and Brown also showed that as the number of similar measurements being combined increases, so does the reliability of the composite scores. In essence, this is why today most psychological scales are formed by aggregating responses to multiple items.

1.4 Cronbach’s Coefficient \(\alpha\)

One way of estimating reliability from observed scores on a multi-item scale is Cronbach’s \(\alpha\). Cronbach developed this coefficient in 1951 before CFA existed. If we consider a composite score made up of \(k\) items, then \(\alpha\) is defined as the ratio of inter-item covariance to composite-score variance weighted by \(k/(k - 1)\). The weight, \(k/(k - 1)\), corrects for the proportion of variance in any item which is due to the same elements as the covariance. In other words, \(\alpha\) is an estimate of the proportion of true-score variance in the multi-item scale scores. As a proportion it can range from 0.0 to 1.0, and will only be equal to 1.0 when all the variance in observed scores is true-score variance. Cronbach’s \(\alpha\) is easily calculated from the item covariance matrix \(\mathbf C\).

\[\mathbf C = \begin{bmatrix} \sigma^2_{1,1} & \sigma_{1,2} & \cdots & \sigma_{1,k} \\ \sigma_{2,1} & \sigma^2_{2,2} & \cdots & \sigma_{2,k} \\ \vdots & \vdots & \ddots & \vdots \\ \sigma_{k,1} & \sigma_{k,2} & \cdots & \sigma^2_{k,k} \end{bmatrix}\]

The diagonal elements are item variances and the off-diagonal elements are covariances. The ratio of inter-item covariance to composite-score variance is equal to 1 minus the ratio of the sum of the item variances (diagonal elements of \(\mathbf C\)) to the sum of all the elements in \(\mathbf C\). This ratio first is calculated, then subtracted from 1, and finally multiplied by \(k/(k - 1)\).

\[ \alpha =\frac{k}{k - 1} \left[1-\frac{\sum_{i=1}^k \sigma^2_{ii}} {\sum_{i=1}^k \sigma^2_{ii} + 2\sum_{i<j}^k\sigma_{ij}} \right] \tag{5} \] Although Cronbach’s \(\alpha\) is by far the most commonly reported coefficient of reliability, there are others. One of these is \(\omega\) which can be calculated from CFA parameter estimates. More on this below.

2. The CFA Model

CFA is a type of structural equation model (SEM) that focuses on the relations between observed variables and the latent variables (i.e., factors) hypothesized to underlie them. I will not go into the math here. CFA models can be used estimate reliability. When working with SEM we often rely on diagrams for presenting our hypothesized model and results. These path diagrams represent observed variables as squares, latent variables as circles, and causal influence with arrows. Figure 1 is a very basic path diagram illustrating CTT; the variance in the observed variable X is the sum of unobserved True-score variance and error variance.

lx<-matrix(1,1,1)
td<-diag(1,nrow(lx))
ph<-matrix(1,1,1)
mod0 <- lisrelModel(LX=lx, TD=td, PH=ph)
semPaths(mod0, as.expression=c("nodes","edges"), sizeMan = 8,
         sizeLat = 10, style="lisrel", rotation = 2,
         residScale=15, edge.color="black", 
         edge.label.cex = 1.5, label.cex=1.5, nodeLabels=c("X", "T"), edgeLabels=c("","e"))
Figure 1. Sources contributing to the variance of *X*. T refers to true-scores, e refers to measurement error.

Figure 1. Sources contributing to the variance of X. T refers to true-scores, e refers to measurement error.


Before we consider reliability and CFA models, we need a bit more theory for dealing with several correlations at the same time.

2.1 Path Analysis Theory

Path Analysis (PA) is a method for decomposing the correlation coefficients within a system of causally related variables. PA applies multiple linear regression and was developed by zoologist Sewall Wright in 1934. Path coefficients are standardized regression coefficients obtained from a set of inter-related linear regression models. PA allows/requires the investigator to theorize about the causal relations among a set of variables and apply his/her thinking to decompose the correlations among the variables.

To illustrate, consider two variables \(x\) and \(y\) that are correlated. We may theorize about why they are correlated using PA. They could correlate because \(x\) causes \(y\) \((x \rightarrow y)\), or because \(y\) causes \(x\) \((y \rightarrow x)\), or because \(x\) and \(y\) share a common cause, \(z\), \((x \leftarrow z \rightarrow y)\). The first two “theories” are referred to as direct effects, the third one is called a spurious effect.

PA makes certain assumptions:

1. the relations among the variables are linear, additive, and causal.
2. The residual terms are uncorrelated. That is, each residual is not correlated with the variables that precede it in the model.
3. There is a one-way causal flow in the system. Reciprocal causation between variables (i.e., feedback loops) is ruled out.
4. The variables are measured on an interval-level scale.
5. The variables are measured without random error.

These assumptions provide part of the framework for applying structural equation models. We should keep them in mind.

2.2 CTT & CFA

CFA estimates the correlations among observed variables in terms of underlying latent variables. This is convenient because we can now represent our CTT-based measurement model in mathematical terms (as a CFA model) and test it on data. Consider a generic case. We want to measure individual differences in some construct (anxiety for example). We know CTT and so we create multiple (three) statements (survey items) that on the face of it capture the construct of interest. Several individuals then read the the statements and indicate how much they agree, or disagree with each. The data can be summarized in a 3x3 covariance matrix.

CTT leads us to hypothesize that the variances of the three items are influenced by one underlying latent variable representing our construct, We can map this hypothesis into a CFA model where the three items \(x_1\), \(x_2\), and \(x_3\) are indicators of one underlying latent variable, \(\xi_1\). Our model is shown as a path diagram in Figure 2.

Specifying a generic CFA the model using LISREL notation.

# creating 1 factor w 3 indicators
# Lambda matrices:
lambdax <- matrix(0,3,1)
lambdax[1:3,1] <-1
# Phi matrix:
LatX <- matrix(1,1,1) #phi
# Theta matrix:
thd <- diag(1,nrow(lambdax))
# Combine matrices into a model:
mod1 <- lisrelModel(LX=lambdax, TD=thd, PH=LatX)

Plotting CFA path diagram:

# Plot path diagram using semPlot::semPaths sizelat=4, sizeman=3
#--making path diagram ---------------------mar(bot,lft,top,rgt) was M 6 L 8

semPaths(mod1, as.expression=c("nodes","edges"), sizeMan = 9,
         sizeLat = 12, style="lisrel", rotation = 3,
         residScale=14, edge.color="black", 
         edge.label.cex = 1.5, label.cex=1.5)
Figure 2. The LISREL model for a one factor model with three indicators.

Figure 2. The LISREL model for a one factor model with three indicators.

Figure 2 is the graphicical representation of our measurement model using LISREL notation. Squares denote observed variables (indicators) and the circle represents the latent variable or construct of interest. Each indicator has a coefficient \(\lambda_{i1}\) (lambda) linking it to the latent variable \(\xi_1\) representing our construct and a coefficient \(\theta_{ii}\) (theta) expressing its measurement error variance. PA tells us that two variables may be correlated because they share a common cause. For example, because \(x_1\) and \(x_2\) share a common cause \(\xi_1\), our model implies that \(r(x_1 x_2) = \lambda_{11}\lambda_{21}\). The same can be said about the other two correlations, \(r(x_1,x_3) = \lambda_{11}\lambda_{31}\) and \(r(x_2,x_3) = \lambda_{21}\lambda_{31}\). If the estimated correlations from our CFA model are consistent with the observed correlations, there is support for the hypothesis that the three items are measuring the same underlying construct.

2.3 Reliability Coefficient \(\omega\)

One way to interpret the \(\lambda_{ij}\) coefficients is to think of them as correlations between the indicator variables and the latent variable. Squaring them tells us the proportion of variance shared between the item and the latent construct it is hypothesized to measure. In the CTT framework this is the proportion of true-score variance in the observed variable, so larger values (approaching 1.0) are ideal. We can use the parameters from our CFA model to estimate the reliability of our multi-item scale using another index \(\omega\) (omega) which is defined as:

\[ \omega = \frac{\sum^k_{i=1}\lambda_i^2} {\sum^k_{i=1}\lambda_i^2 + \sum^k_{i=1}\theta_{ii}} \tag{6} \] The numerator represents true-score variance and the denominator represents observed variance. Equation 6 can be applied to standardized coefficients (as is done here) and to undstandardized coefficients.

2.4 CFA using lavaan

Specifying models in lavaan is pretty straightforward. There are different operators for specifying various relations among variables. Latent variables are defined using the =~ operator with the latent variable on the left and the indicator variables on the right. In the example below I show the R code for the how to specify and run a very simple CFA model in lavaan. One nice feature is that we don’t always need the raw data file; we can work from the covariance matrix, or correlation matrix plus a vector of standard deviations and the sample size.

3. An Example

This example involves three items hypothesized to measure anxiety related to COVID-19:

ax1: I am afraid of COVID-19.
ax2: Thinking about COVID-19 makes me feel threatened.
ax3: I am stressed around people because I worry I’ll catch COVID-19.

The response options were: 1 ‘Strongly Disagree’, 2 ‘Disagree’, 3 ‘Neither Agree nor Disagree’, 4 ‘Agree’, and 5 ‘Strongly Agree’. These items were included in a survey of faculty, staff, and students (N = 38,921) at a major university in the Southeast United States during the first few weeks of the pandemic in 2020.

Reading in the data:
# reading in corr matrix
corr <- '
1.000
.728 1.000
.714 .713 1.000 '
# add variable names and convert to full correlation matrix
corrfull <- lavaan::getCov(corr, names=c("ax1","ax2","ax3"))
# add SDs and convert to full covariance matrix
covfull <- lavaan::cor2cov(corrfull, sds=c(1.163, 1.127, 1.169))
# observed covariance matrix N=38921
# vector of means that may be needed (scores range from 1 to 5)
Mns<- c(3.26, 2.76, 3.00)

3.1 Correlation Table

Using sjPlot::tab_corr to create this nice table.

# Uses sjPlot to print a nice looking correlation table
sjPlot::tab_corr(corrfull, na.deletion = "listwise", digits = 3, triangle = "lower",
         title = "Correlations among observed variables (N =38,921)",
         string.diag=c('1.000','1.000','1.000'))
Correlations among observed variables (N =38,921)
  ax1 ax2 ax3
ax1 1.000    
ax2 0.728 1.000  
ax3 0.714 0.713 1.000
Computed correlation used pearson-method with listwise-deletion.
ax1: I am afraid of COVID-19.
ax2: Thinking about COVID-19 makes me feel threatened.
ax3: I am stressed around people because I worry I’ll catch COVID-19.


CCT posits that because these items are measurements of the same construct, responses to them should be correlated. Notice that the correlations range from .713 to .728. Consider the correlation between the first two items \(r(ax1, ax2) = .728\). If we square this correlation we get the proportion of variance that these two items share \(r^2(ax1, ax2) = .728^2 = 530\).

3.2 CFA of COVID Anxiety Scale

Specify the model:

# specify the CFA model
model1 <- ' COVID_Axty =~ ax1 + ax2 + ax3 '

Running the model:

# fit the path model
fit <- lavaan::cfa(model1, sample.cov=covfull, sample.nobs=38921, std.lv=T)

3.3 Results

Plotting results as path diagram:

# making path CFA diagram
fig3 <- semPlot::semPaths(fit, whatLabels = "std", layout = "tree2", 
         rotation = 1, style = "lisrel", optimizeLatRes = TRUE, nDigits=3,
         structural = FALSE, layoutSplit = FALSE,
         intercepts = FALSE, residuals = T, 
         curve = 3, curvature = 3, nCharNodes = 8, 
         sizeLat = 24, sizeLat2 = 12, sizeMan = 9,
         edge.label.cex = 1.5, label.cex=1.0, residScale=14, 
         edge.color = "black", edge.label.position = .40, DoNotPlot=T)
#--working with semptools to modify path diagrams
fig3_1 <- fig3  |> semptools::mark_sig(fit, alpha = c("(n.s.)" = 1.00, "*" = .05)) #adding *, ns
plot(fig3_1)
Figure 3. COVID Anxiety Scale measurement model

Figure 3. COVID Anxiety Scale measurement model

ax1: I am afraid of COVID-19.
ax2: Thinking about COVID-19 makes me feel threatened.
ax3: I am stressed around people because I worry I’ll catch COVID-19.


Let’s examine the parameter estimates from our model shown in Figure 3. The \(\lambda\) for ax1 is .854 and the \(\lambda\) for ax2 is .853. These are the estimated correlations of these items with the underlying construct COVID Anxiety. If we think about the notion of spurious effects from PA, we can say that the reason responses to these two items are so highly correlated is because they are both being strongly influenced by the same underlying construct, COVID-Anxiety. If we multiply these two \(\lambda\)s together we get .728 which is the observed correlation \(r(ax1,ax2)\) shown in the correlation table above. The same can be done to reproduce the other two correlations.

The three \(\theta\)s at the bottom of Figure 3 show the estimated proportion of measurement error variance in each item. If considered separately, rather than as a composite, roughly 28% of the variance in each item is measurement error. Squaring the \(\lambda\)s gives the proportion of true-score variance being “captured” by each item. Note that because all coefficients are standardized we have \(\lambda^2 + \theta = 1.0\) for each item


Full output:

I used the summary() function to write lavaan results to the console. There is a lot of output to review. The first bit is model fit statistics. Because this model is just-identified (there are as many parameters as there are data elements in the covariance matrix) the degrees of freedom are 0 and the model fit is perfect. Next we see Parameter Estimates beginning with the Latent Variables section that shows the factor loadings \(\lambda_{ij}\). The column labeled Estimate gives the unstandardized coefficients, and the column Std.all gives the completely standardized coefficients that are shown in Figure 3. Next come the Variances. Variables prefixed with a “.” indicate residual terms so these are the \(\theta_{ii}\) estimates. Finally, we see the \(R^2\) values indicating the proportion of variance that each item shares with the latent variable.
summary(fit, fit.measures=T, standardized=T, rsquare=T)
## lavaan 0.6.15 ended normally after 12 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                         6
## 
##   Number of observations                         38921
## 
## Model Test User Model:
##                                                       
##   Test statistic                                 0.000
##   Degrees of freedom                                 0
## 
## Model Test Baseline Model:
## 
##   Test statistic                             64012.579
##   Degrees of freedom                                 3
##   P-value                                        0.000
## 
## User Model versus Baseline Model:
## 
##   Comparative Fit Index (CFI)                    1.000
##   Tucker-Lewis Index (TLI)                       1.000
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)            -150279.741
##   Loglikelihood unrestricted model (H1)    -150279.741
##                                                       
##   Akaike (AIC)                              300571.482
##   Bayesian (BIC)                            300622.898
##   Sample-size adjusted Bayesian (SABIC)     300603.830
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.000
##   90 Percent confidence interval - lower         0.000
##   90 Percent confidence interval - upper         0.000
##   P-value H_0: RMSEA <= 0.050                       NA
##   P-value H_0: RMSEA >= 0.080                       NA
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.000
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   COVID_Axty =~                                                         
##     ax1               0.993    0.005  198.586    0.000    0.993    0.854
##     ax2               0.961    0.005  198.199    0.000    0.961    0.853
##     ax3               0.978    0.005  192.966    0.000    0.978    0.836
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##    .ax1               0.367    0.004   85.404    0.000    0.367    0.271
##    .ax2               0.347    0.004   85.954    0.000    0.347    0.273
##    .ax3               0.411    0.004   93.077    0.000    0.411    0.301
##     COVID_Axty        1.000                               1.000    1.000
## 
## R-Square:
##                    Estimate
##     ax1               0.729
##     ax2               0.727
##     ax3               0.699

3.4 Reliability Analysis

The semTools package has a function compRelSEM that reads the lavaan output and returns the reliability estimates for each latent variable. Setting the options ‘tau.eq=T’ and ‘obs.var=T’ produces equivalent to Cronbach’s reliability coefficient \(\alpha\) (alpha). Setting ‘tau.eq=F’ produces the \(\omega\) (omega) coefficient of reliability. In this exmaple the two are quite similar.

relalpha<-semTools::compRelSEM(fit, tau.eq=T, obs.var=T)
The Cronbach’s \(\alpha\) reliability for COVID-Anxiety Scale (using eq. 5) is 0.8842327.
relomega<-semTools::compRelSEM(fit, tau.eq=F, obs.var=T)
The \(\omega\) reliability coefficient for COVID-Anxiety Scale (using eq. 6) is 0.8843122.

4. Concluding Remarks

Reliability of measurements is the extent to which they are free from random measurement error. The reliability of a set of measurements limits the degree to which they can correlate with measurements on other variables.

In this lecture we covered the basics of psychometric measurement theory and mapped key concepts onto the CFA model. We examined the reliability of data obtained using a three-item scale intended to measure anxiety related to COVID-19. CFA provided us with some psychometric information about these data. The correlations among the items are consistent with a single underlying latent variable representing our construct. Each item had about 28% measurement error variance. When treated as a composite scale the amount of measurement error variance decreased to about 12% because the true-score variance was estimated to be 88% (both \(\alpha\) and \(\omega\) were \(\approx .88\)).

Reliability is a property of data, not of the instruments used to obtain them; this means that it is more important to calculate and report the reliability of your own data, than it is to cite previously published values of reliability.

5. References

Beckstead, J.W. (2013). On measurements and their quality: Paper 1. Reliability - history, issues and procedures. International Journal of Nursing Studies, 50 (7), 968-973.

Brown, W., (1910). Some experimental results in the correlation of mental abilities. British Journal of Psychology 3, 296–322.

Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16 (3), 297–334.

Spearman, C., (1904). General intelligence: objectively determined and measured. American Journal of Psychology 15, 201–293.

Spearman, C., (1910). Correlation calculated with faulty data. British Journal of Psychology 3, 271–295.

I Wright, S. (1934). The method of path coefficients. Annals of Mathematical Statistics, 5, 161-215.
---
title: "An Introduction to Psychometric Measurement Theory & Confirmatory Factor Analysis."
author: "Jason Beckstead"
date: "`r Sys.Date()`"
output: 
  html_document:
    toc: true
    toc_float:
      collapsed: false
    toc_depth: 3
    code_folding: hide
    code_download: true
editor_options: 
  chunk_output_type: console
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)

if (!require("pacman")) install.packages("pacman")
library(pacman)
# Load installed packages, install and load the package if not installed
pacman::p_load(here, rio, conflicted, tidyverse, 
               janitor, summarytools, skimr, DataExplorer, inspectdf, Hmisc,
               lavaan, semTools, semoutput, semPlot, sjPlot, psych, readr, dplyr,
               GPArotation, psychTools, semptools,tinytex)
```

## 1. Classical Test Theory
In this lecture I explain and illustrate concepts from measurement theory, with the focus being on reliability. Classical test theory (CTT) has been the foundation of psychological measurement for many years. Although measurement theory has evolved over the years to include Generalizability Theory and Item Response Theory, I will limit my discussion to CTT as it dovetails nicely with the basic concepts found in confirmatory factor analysis (CFA). Much of the introduction is borrowed from a paper I published in 2013. I will use the nouns instrument, measure, and scale interchangeably to refer to the means by which we obtain data from people. I created this document in R Markdown.


### 1.1 The Beginning
| One of the most fundamental tenets of measurement theory was first proposed by Carl Spearman in 1904 when working to develop a means to measure individual differences in intelligence, a stable person characteristic, or trait. Spearman conceived of every measurement, or observed score, $X_i$, as consisting of two components, a true score on the construct of interest, $T_i$, and an error score, $e_i$, so we can write:
$$
X_i = T_i + e_i \tag{1}
$$

Because they contain error, observed values of $X_i$ are considered fallible. The true score, $T_i$, is the score that would be obtained under ideal or perfect conditions of measurement. Spearman’s formulation has become known as true-score theory or classic measurement theory (CTT). Because both $T_i$  and $e_i$ are unknowns, this formula cannot be used to estimate the measurement error in the observed scores without further assumptions.
 
| *Assumption 1:* for an individual, the construct being measured is constant (over some specified time period) and the errors in measurement are random. This suggests that if an individual were to be measured an infinite number of times a series of *X* values would result, each consisting of the same true score but differing due to different error scores. Being random, the expected value of the error scores is zero, $E(e_i) = 0$. *Assumption 2:* given assumption 1, the true score is equal to the expected value of the observed scores over an infinite number of repeated measurements (made under similar conditions), $T_i = E(X_i)$. *Assumption 3:* observed differences among individuals may be due to differences in their true scores or due to differences in their error scores. This implies that the variance of observed scores is a composite of the variance of true scores and the variance of error scores:
$$
\sigma^2_X = \sigma^2_T + \sigma^2_e \tag{2} 
$$
and a little algebra shows that
$$
\frac{\sigma^2_T}{\sigma^2_X} = 1 - \frac{\sigma^2_e}{\sigma^2_X} \tag{3}
$$

This last piece, the extent to which a set of measurements is free from random error variance is **reliability**. As a proportion, reliability can range from 0 to 1; it equals 1 when all the observed variance in a set of measurements is due only to true-score variance, that is, when there are no random errors of measurement, and it equals 0 when all the observed variance is due to random error variance. This definition implies that some measurement errors can be random while others are systematic. When measurement error is systematic, it is referred to as bias. In the remainder of this lecture we focus on random measurement
error.


### 1.2 Covariance and Correlation
Variance is a summary statistic that indicates the variation in a set of scores from their mean (and standard deviation is the square root of variance). Covariance indicates the covariation of two sets of scores from their respective means. Covariance, and correlation have been, and continue to be extremely useful in measurement theory. To briefly illustrate covariance assume that we have measured three people on two variables *x* and *y*; their *x* scores are: 2, 4, and 6, and their *y* scores are: 7, 5, and 9. The means are readily calculated as 4 for *x* and 7 for *y*. If we express each person’s scores as a deviation from the respective means we get -2, 0, 2 for *x* and 0, -2, 2 for *y*, respectively. If we multiply the pairs of deviation scores, total them up (-2)(0) + (0)(-2) + (2)(2) = 4, and divide by the number of people contributing scores, 4/3, we obtain covariance, 1.33. Covariance can be either positive or negative, can range from negative to positive infinity, and its magnitude is dependent on the metric(s) of the two variables involved. Therefore it is difficult to know whether 1.33 represents a lot or a little covariation. 

| A more convenient metric for expressing covariation is the correlation coefficient $r$. The correlation coefficient has the advantage that it is standardized; it is equal to 1.0 when there is perfect positive covariation between the two sets of scores, -1.0 when there is perfect negative covariation between the two sets of scores, and equal to 0.0 when there is no covariation. If a covariance is divided by the product of the standard deviations for *x* and *y*, we obtain $r(x, y)$, the correlation between *x* and *y*. In the current example $r(x, y) = 1.33/(1.63)(1.63) = .50$. For this reason, we often refer to a correlation as a *standardized covariance*. Another useful property of the correlation coefficient is that its square represents the proportion of variance shared between *x* and *y*. In this example $r(x, y)^2 = .25$, so we may state that 25% of the variance in *x* is shared with *y*, and 25% of the variance in *y* is shared with *x*. 

### 1.3 Reliability Defined
| According to CTT, reliability is the proportion of variance in a set of measurements associated with true-score variance. In other words if we could correlate true scores with observed scores and square the result as, $r(T_i, X_i)^2$, this would indicate the proportion of variance in the observed scores that is shared with the true scores. We can now write a formal definition: 
$$
reliability = r(T_i, X_i)^2 = \frac{\sigma^2_T}{\sigma^2_X} = 1 - \frac{\sigma^2_e}{\sigma^2_X} \tag{4}
$$


Recall that true scores are never known, and so we must derive a means of estimating $r(T_i, X_i)^2$ from observable scores.

| Measurement theorists (psychometricians) have studied random measurement error for a long time. In two seminal papers published in the same issue of the *British Journal of Psychology* in 1910, Carl Spearman and William Brown showed that when multiple measurements of a construct (e.g., intelligence, anxiety, emotional well-being, etc.) are obtained from the same individual and combined, the ratio of true-score variance to observed-score variance in the composite score increases. In other words, one way to reduce random measurement error is to create instruments that contain multiple measurements and to aggregate these measurements together when defining the observed score for an individual. Spearman and Brown also showed that as the number of similar measurements being combined increases, so does the reliability of the composite scores. In essence, this is why today most psychological scales are formed by aggregating responses to multiple items.


### 1.4 Cronbach's Coefficient $\alpha$
One way of estimating reliability from observed scores on a multi-item scale is Cronbach's $\alpha$. Cronbach developed this coefficient in 1951 before CFA existed. If we consider a composite score made up of $k$ items, then $\alpha$ is defined as the ratio of inter-item covariance to composite-score variance weighted by $k/(k - 1)$. The weight, $k/(k - 1)$, corrects for the proportion of variance in any item which is due to the same elements as the covariance. In other words, $\alpha$ is an estimate of the proportion of true-score variance in the multi-item scale scores. As a proportion it can range from 0.0 to 1.0, and will only be equal to 1.0 when all the variance in observed scores is true-score variance. Cronbach’s $\alpha$ is easily calculated from the item covariance matrix $\mathbf C$. 

$$\mathbf C =
 \begin{bmatrix}
  \sigma^2_{1,1} & \sigma_{1,2} & \cdots & \sigma_{1,k} \\
  \sigma_{2,1} & \sigma^2_{2,2} & \cdots & \sigma_{2,k} \\
  \vdots  & \vdots  & \ddots & \vdots  \\
  \sigma_{k,1} & \sigma_{k,2} & \cdots & \sigma^2_{k,k}
 \end{bmatrix}$$
 

The diagonal elements are item variances and the off-diagonal elements are covariances. The ratio of inter-item covariance to composite-score variance is equal to 1 minus the ratio of the sum of the item variances (diagonal elements of $\mathbf C$) to the sum of all the elements in $\mathbf C$. This ratio first is calculated, then subtracted from 1, and finally multiplied by $k/(k - 1)$. 


$$
\alpha =\frac{k}{k - 1} \left[1-\frac{\sum_{i=1}^k \sigma^2_{ii}} {\sum_{i=1}^k \sigma^2_{ii} + 2\sum_{i<j}^k\sigma_{ij}} \right] \tag{5}
$$
Although Cronbach's $\alpha$ is by far the most commonly reported coefficient of reliability, there are others. One of these is $\omega$ which can be calculated from CFA parameter estimates. More on this below.


## 2. The CFA Model
CFA is a type of structural equation model (SEM) that focuses on the relations between observed variables and the latent variables (i.e., factors) hypothesized to underlie them. I will not go into the math here. CFA models can be used estimate reliability. When working with SEM we often rely on diagrams for presenting our hypothesized model and results. These path diagrams represent observed variables as squares, latent variables as circles, and causal influence with arrows. Figure 1 is a very basic path diagram illustrating CTT; the variance in the observed variable *X* is the sum of unobserved True-score variance and error variance.
```{r, fig.width=5, fig.height=2, fig.align='center', fig.cap='Figure 1. Sources contributing to the variance of *X*. T refers to true-scores, e refers to measurement error.'}
lx<-matrix(1,1,1)
td<-diag(1,nrow(lx))
ph<-matrix(1,1,1)
mod0 <- lisrelModel(LX=lx, TD=td, PH=ph)
semPaths(mod0, as.expression=c("nodes","edges"), sizeMan = 8,
         sizeLat = 10, style="lisrel", rotation = 2,
         residScale=15, edge.color="black", 
         edge.label.cex = 1.5, label.cex=1.5, nodeLabels=c("X", "T"), edgeLabels=c("","e"))
```
| 
| Before we consider reliability and CFA models, we need a bit more theory for dealing with several correlations at the same time.

### 2.1 Path Analysis Theory
Path Analysis (PA) is a method for decomposing the correlation coefficients within a system of causally related variables. PA applies multiple linear regression and was developed by zoologist Sewall Wright in 1934. *Path coefficients are standardized regression coefficients obtained from a set of inter-related linear regression models*. PA allows/requires the investigator to theorize about the causal relations among a set of variables and apply his/her thinking to decompose the correlations among the variables.


| To illustrate, consider two variables $x$ and $y$ that are correlated. We may theorize about *why* they are correlated using PA. They could correlate because $x$ causes $y$ $(x \rightarrow y)$, or because $y$ causes $x$ $(y \rightarrow x)$, or because $x$ and $y$ share a common cause, $z$, $(x \leftarrow z \rightarrow y)$. The first two "theories" are referred to as direct effects, the third one is called a *spurious effect*.

|
| 
PA makes certain assumptions:

| 1. the relations among the variables are linear, additive, and causal.

| 2. The residual terms are uncorrelated. That is, each residual is not correlated with the variables that precede it in the model. 

| 3. There is a one-way causal flow in the system. Reciprocal causation between variables (i.e., feedback loops) is ruled out.

| 4. The variables are measured on an interval-level scale.

| 5. The variables are measured without random error.
These assumptions provide part of the framework for applying structural equation models. We should keep them in mind.


### 2.2 CTT & CFA
CFA estimates the correlations among observed variables in terms of underlying latent variables. This is convenient because we can now represent our CTT-based measurement model in mathematical terms (as a CFA model) and test it on data. Consider a generic case. We want to measure individual differences in some construct (anxiety for example). We know CTT and so we create multiple (three) statements (survey items) that on the face of it capture the construct of interest. Several individuals then read the the statements and indicate how much they agree, or disagree with each. The data can be summarized in a 3x3 covariance matrix.

| CTT leads us to hypothesize that the variances of the three items are influenced by one underlying latent variable representing our construct, We can map this hypothesis into a CFA model where the three items $x_1$, $x_2$, and $x_3$ are indicators of one underlying latent variable, $\xi_1$. Our model is shown as a path diagram in Figure 2. 
| 
| 

Specifying a generic CFA the model using LISREL notation. 
```{r model, echo=T}
# creating 1 factor w 3 indicators
# Lambda matrices:
lambdax <- matrix(0,3,1)
lambdax[1:3,1] <-1
# Phi matrix:
LatX <- matrix(1,1,1) #phi
# Theta matrix:
thd <- diag(1,nrow(lambdax))
# Combine matrices into a model:
mod1 <- lisrelModel(LX=lambdax, TD=thd, PH=LatX)
```

Plotting CFA path diagram:
```{r pathdiagram, echo=T, fig.width=4,fig.height=4, fig.align='center', fig.cap="Figure 2. The LISREL model for a one factor model with three indicators."}
# Plot path diagram using semPlot::semPaths sizelat=4, sizeman=3
#--making path diagram ---------------------mar(bot,lft,top,rgt) was M 6 L 8

semPaths(mod1, as.expression=c("nodes","edges"), sizeMan = 9,
         sizeLat = 12, style="lisrel", rotation = 3,
         residScale=14, edge.color="black", 
         edge.label.cex = 1.5, label.cex=1.5)
```
| 

| Figure 2 is the graphicical representation of our measurement model using LISREL notation. Squares denote observed variables (indicators) and the circle represents the latent variable or construct of interest. Each indicator has a coefficient $\lambda_{i1}$ (lambda) linking it to the latent variable $\xi_1$ representing our construct and a coefficient $\theta_{ii}$ (theta) expressing its measurement error variance. PA tells us that two variables may be correlated because they share a common cause. For example, because $x_1$ and $x_2$ share a common cause $\xi_1$, our model implies that $r(x_1 x_2) = \lambda_{11}\lambda_{21}$. The same can be said about the other two correlations, $r(x_1,x_3) = \lambda_{11}\lambda_{31}$ and $r(x_2,x_3) = \lambda_{21}\lambda_{31}$. If the estimated correlations from our CFA model are consistent with the observed correlations, there is support for the hypothesis that the three items are measuring the same underlying construct.

### 2.3 Reliability Coefficient $\omega$
| One way to interpret the $\lambda_{ij}$ coefficients is to think of them as correlations between the indicator variables and the latent variable. Squaring them tells us the proportion of variance shared between the item and the latent construct it is hypothesized to measure. In the CTT framework this is the proportion of true-score variance in the observed variable, so larger values (approaching 1.0) are ideal. We can use the parameters from our CFA model to estimate the reliability of our multi-item scale using another index $\omega$ (omega) which is defined as:
$$
\omega = \frac{\sum^k_{i=1}\lambda_i^2} 
{\sum^k_{i=1}\lambda_i^2 + \sum^k_{i=1}\theta_{ii}}   \tag{6}
$$
The numerator represents true-score variance and the denominator represents observed variance. Equation 6 can be applied to standardized coefficients (as is done here) and to undstandardized coefficients.


### 2.4 CFA using lavaan
Specifying models in `lavaan` is pretty straightforward. There are different operators for specifying various relations among variables. Latent variables are defined using the `=~` operator with the latent variable on the left and the indicator variables on the right. In the example below I show the R code for the how to specify and run a very simple CFA model in `lavaan`. One nice feature is that we don't always need the raw data file; we can work from the covariance matrix, or correlation matrix plus a vector of standard deviations and the sample size. 


## 3. An Example
This example involves three items hypothesized to measure anxiety related to COVID-19:

| *ax1:* I am afraid of COVID-19.
| *ax2:* Thinking about COVID-19 makes me feel threatened.
| *ax3:* I am stressed around people because I worry I'll catch COVID-19.
| 
| The response options were: 1 'Strongly Disagree', 2 'Disagree', 3 'Neither Agree nor Disagree',  4 'Agree', and 5 'Strongly Agree'. These items were included in a survey of faculty, staff, and students (*N* = 38,921) at a major university in the Southeast United States during the first few weeks of the pandemic in 2020.

| 
| Reading in the data:
```{r, echo=T}
# reading in corr matrix
corr <- '
1.000
.728 1.000
.714 .713 1.000 '
# add variable names and convert to full correlation matrix
corrfull <- lavaan::getCov(corr, names=c("ax1","ax2","ax3"))
# add SDs and convert to full covariance matrix
covfull <- lavaan::cor2cov(corrfull, sds=c(1.163, 1.127, 1.169))
# observed covariance matrix N=38921
# vector of means that may be needed (scores range from 1 to 5)
Mns<- c(3.26, 2.76, 3.00)
```

### 3.1 Correlation Table
Using `sjPlot::tab_corr` to create this nice table.
```{r}
# Uses sjPlot to print a nice looking correlation table
sjPlot::tab_corr(corrfull, na.deletion = "listwise", digits = 3, triangle = "lower",
         title = "Correlations among observed variables (N =38,921)",
         string.diag=c('1.000','1.000','1.000'))
```
| *ax1:* I am afraid of COVID-19.
| *ax2:* Thinking about COVID-19 makes me feel threatened.
| *ax3:* I am stressed around people because I worry I'll catch COVID-19.
|
|
| CCT posits that because these items are measurements of the same construct, responses to them should be correlated. Notice that the correlations range from .713 to .728. Consider the correlation between the first two items $r(ax1, ax2) = .728$. If we square this correlation we get the proportion of variance that these two items share $r^2(ax1, ax2) = .728^2 = 530$.

### 3.2 CFA of COVID Anxiety Scale
|
Specify the model:
```{r, echo=T}
# specify the CFA model
model1 <- ' COVID_Axty =~ ax1 + ax2 + ax3 '
```

Running the model:
```{r, echo=T}
# fit the path model
fit <- lavaan::cfa(model1, sample.cov=covfull, sample.nobs=38921, std.lv=T)
```

### 3.3 Results

Plotting results as path diagram:
```{r, fig.width=4,fig.height=4, fig.cap="Figure 3. COVID Anxiety Scale measurement model", echo=T, fig.align='center'}
# making path CFA diagram
fig3 <- semPlot::semPaths(fit, whatLabels = "std", layout = "tree2", 
         rotation = 1, style = "lisrel", optimizeLatRes = TRUE, nDigits=3,
         structural = FALSE, layoutSplit = FALSE,
         intercepts = FALSE, residuals = T, 
         curve = 3, curvature = 3, nCharNodes = 8, 
         sizeLat = 24, sizeLat2 = 12, sizeMan = 9,
         edge.label.cex = 1.5, label.cex=1.0, residScale=14, 
         edge.color = "black", edge.label.position = .40, DoNotPlot=T)
#--working with semptools to modify path diagrams
fig3_1 <- fig3  |> semptools::mark_sig(fit, alpha = c("(n.s.)" = 1.00, "*" = .05)) #adding *, ns
plot(fig3_1)
```
| *ax1:* I am afraid of COVID-19.
| *ax2:* Thinking about COVID-19 makes me feel threatened.
| *ax3:* I am stressed around people because I worry I'll catch COVID-19.
|
|
| Let's examine the parameter estimates from our model shown in Figure 3. The $\lambda$ for *ax1* is .854 and the $\lambda$ for *ax2* is .853. These are the estimated correlations of these items with the underlying construct COVID Anxiety. If we think about the notion of spurious effects from PA, we can say that the reason responses to these two items are so highly correlated is because they are both being strongly influenced by the same underlying construct, COVID-Anxiety. If we multiply these two $\lambda$s together we get .728 which is the observed correlation $r(ax1,ax2)$ shown in the correlation table above. The same can be done to reproduce the other two correlations. 

| 
| The three $\theta$s at the bottom of Figure 3 show the estimated proportion of measurement error variance in each item. If considered separately, rather than as a composite, roughly 28% of the variance in each item is measurement error. Squaring the $\lambda$s gives the proportion of true-score variance being "captured" by each item. Note that because all coefficients are standardized we have $\lambda^2 + \theta = 1.0$ for each item


```{css, echo=FALSE}
.scroll-500 {
  max-height: 500px;
  overflow-y: auto;
  background-color: inherit;
}
```
|
|
| 
Full output:

| I used the `summary()` function to write `lavaan` results to the console. There is a lot of output to review. The first bit is model fit statistics. Because this model is just-identified (there are as many parameters as there are data elements in the covariance matrix) the degrees of freedom are 0 and the model fit is perfect. Next we see **Parameter Estimates** beginning with the Latent Variables section that shows the factor loadings $\lambda_{ij}$. The column labeled *Estimate* gives the unstandardized coefficients, and the column *Std.all* gives the completely standardized coefficients that are shown in Figure 3. Next come the Variances. Variables prefixed with a "." indicate residual terms so these are the $\theta_{ii}$ estimates. Finally, we see the $R^2$ values indicating the proportion of variance that each item shares with the latent variable. 

```{r, class.output="scroll-500"}
summary(fit, fit.measures=T, standardized=T, rsquare=T)
```
|
| 
### 3.4 Reliability Analysis
The `semTools` package has a function `compRelSEM` that reads the `lavaan` output and returns the reliability estimates for each latent variable. Setting the options 'tau.eq=T' and 'obs.var=T' produces equivalent to Cronbach's reliability coefficient $\alpha$ (alpha). Setting 'tau.eq=F' produces the $\omega$ (omega) coefficient of reliability. In this exmaple the two are quite similar.
```{r, results='hold'}
relalpha<-semTools::compRelSEM(fit, tau.eq=T, obs.var=T)
```
| The Cronbach's $\alpha$ reliability for COVID-Anxiety Scale (using eq. 5) is `r relalpha [1]`.

```{r, results='hold'}
relomega<-semTools::compRelSEM(fit, tau.eq=F, obs.var=T)
```
| The $\omega$ reliability coefficient for COVID-Anxiety Scale (using eq. 6) is `r relomega [1]`.

## 4. Concluding Remarks

Reliability of measurements is the extent to which they are free from random measurement error. The reliability of a set of measurements limits the degree to which they can correlate with measurements on other variables.

|

| In this lecture we covered the basics of psychometric measurement theory and mapped key concepts onto the CFA model. We examined the reliability of data obtained using a three-item scale intended to measure anxiety related to COVID-19. CFA provided us with some psychometric information about these data. The correlations among the items are consistent with a single underlying latent variable representing our construct. Each item had about 28% measurement error variance. When treated as a composite scale the amount of measurement error variance decreased to about 12% because the true-score variance was estimated to be 88% (both $\alpha$ and $\omega$ were $\approx .88$).

|
| Reliability is a property of data, not of the instruments used to obtain them; this means that it is more important to calculate and report the reliability of your own data, than it is to cite previously published values of reliability.


## 5. References

| Beckstead, J.W. (2013). On measurements and their quality: Paper 1. Reliability - history, issues and procedures. *International Journal of Nursing Studies, 50* (7), 968-973.

| 
| Brown, W., (1910). Some experimental results in the correlation of mental abilities. *British Journal of Psychology 3*, 296–322.

| 
| Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. *Psychometrika, 16* (3), 297–334.

|
| Spearman, C., (1904). General intelligence: objectively determined and measured. *American Journal of Psychology 15*, 201–293.

| 
| Spearman, C., (1910). Correlation calculated with faulty data. *British Journal of Psychology 3*, 271–295.
 
|
| I Wright, S. (1934). The method of path coefficients. *Annals of Mathematical Statistics, 5*, 161-215.
