Reliability of pre, post measures

Author

Julius

Notes

## global variables: 
# only if needed

Theory

brief theory of latent variable models

In context of answering survey question the Cognitive Aspects of Survey Methodology (also called „Optimizing-Satisficing-Model“) tries to explain how people finally arrive at a response, which is a great heuristic for identifying possible sources of error (Tourangeau, Rips, and Rasinski 2000,; Moosbrugger and Kelava 2020):

Cognitive Aspects of Survey Methodology

Despite not knowing the exact processes of answering a question (black box), we fundamentally assume that the answer / reporting depends on a score on a latent variable:

Epicycles of Analysis by Peng and Matsui (2016)

the central task of test theory is to determine the relationship between test behaviour and the (psychological) characteristic to be assessed

Latent variable definition: random variables whose realized values are hidden (Bollen 2002; Borsboom 2008)

A possible operationalisation of a latent variable is a linear measurement model with the equation Y = \lambda * \eta + \epsilon.

basic measurment model

this corresponds to the fundamental equation of Classical Test Theory: Y = T + \epsilon

\rightarrow models which are dealing with latent variables are called Latent Variable Models (Skrondal and Rabe-Hesketh 2004, 2007), whereby the central aim of latent variable models is to infer unobservable (latent) psychological traits or abilities from responses to test items

Latent variable models Skrondal and Rabe-Hesketh (2007)

theoretical aspects “What is a Latent Variable?”

  • the theoretical status of latent variables has not been clarified (Schurig 2017)
    • are these variables representations of real entities or just useful inventions?
    • Advantage: use of latent variables allow more generalisable reasoning than manifest variables
  • latent variables need a substantive scientific foundation, whereby the bridging problem between observed and latent variable must be solved by theoretical assumptions and statistical modelling

centrally by assuming the local independence assumption in latent variable models it is assumed that, once the latent variable (e.g., a psychological trait or ability) is accounted for, the observed variables (such as responses to test items) are statistically independent of each other. This means that any correlation between the observed variables is fully explained by the latent variable, and no further direct relationships exist between the observed variables. In essence, the latent variable “absorbs” the shared variance, allowing the model to “get rid” of any inter-dependencies among the observed variables, simplifying the analysis (Skrondal and Rabe-Hesketh 2007).

Pr(y_j \mid \eta_j) = \prod_{i=1}^{n} Pr(y_{ij} \mid \eta_j)

This assumption do not hold in more complex data sets (e.g., multi-dimensionality, response styles, …) and by more complex models (e.g., the bifactor model) the variance of indicators of single measurment models (CFAs) is divided into common and unique variance:

Variance splitting in CFAs

The variance shared between the indicators is the commonality; the remaining variance is the unique variance, which is divided into indicator-specific method variance (specific) and measurement variance (error).


quality criteria of measurments

measurements, test administration should be carried out taking into account three central quality criteria of tests, which build on each other (no reliable measurement possible without objectivity, etc.):

\text{objectivity} \rightarrow \text{reliability} \rightarrow \text{validity}

  • Objectivity: Test score is objective if it is independent of any influences outside the tested person (e.g., situational conditions, experimenter – all exogenous variables whose covariance structure is not explained by the statistical model, including error terms, unobserved influencing variables, and exogenous latent constructs)
    • Implementation objectivity (Durchführungsobjektivität): Standardization of implementation conditions (writing a test manual, training test leader, standardization of all other conditions).
    • Objectivity of evaluation (Auswertungsobjektivität): The interpretation of the test result is not dependent on the person who evaluates the test (measurable by inter-rater reliability, such as Kendall’s coefficient of concordance).
    • Objectivity of interpretation (Interpretationsobjektivität): Different test users come to the same conclusions with identical test scores.
  • Reliability: The extent to which a test measures what it is intended to measure. The focus here is on measurement accuracy. Reliability is demonstrated theoretically by the fact that repeated measurements under the same conditions produce the same measurement results (the central contribution to the development of reliability measurement is made by classical test theory, which establishes a theory of measurement error).
    • Reliability can be estimated by different methods, often as a measure of internal consistency - Cronbach’s Alpha (a measure of how items in a scale correlate with one another) is commonly used.

The classical test theory assumes that the test performance of a person on the question i is composed of x_{i} = \tau_{i} + \epsilon_{i}. Here \tau_{i} corresponds to the person’s true score on question i, which is composed of an item response x_{i} and the error \epsilon_{i}, where the error is unbiased (if there are systematic aspects in the errors, apply models with can do variance splitting)

  • Validity: A test is considered valid if it actually measures the characteristic it is supposed to measure and not some other characteristic. The measurement of validity is done in two steps:
    • Via structure-searching (such as exploratory factor analysis) and structure-testing (such as confirmatory factor analysis) procedures, construct validity is determined. This indicates the extent to which conclusions can be drawn from test results, for example, about psychological personality traits.
    • The agreement of the results of the individual constructs should be high with constructs that measure the same or similar characteristics (convergent validity), and the agreement with results from constructs that measure other characteristics should be low (discriminant validity). This can be analyzed via (latent) correlations, see also construct validity (Cronbach and Meehl 1955)

Importantly more recently there is also an argument-based approach to validation. To validate an interpretation or use of measurements is to evaluate the rationale, or argument, for the proposed conclusions and decisions … Ultimately, the need for validation derives from the scientific and social requirement that public claims and decisions need to be justified:

  • interpretive argument: specifies the proposed interpretations and applications of assessment results by laying out a network of inferences and assumptions leading from the observed performances to the conclusions and decisions based on the assessment scores
  • validity argument: provides an evaluation of the interpretive argument’s coherence and the plausibility of its inferences and assumptions

A variety of further quality criteria of indicators were developed by the „Key National Indicators Initiative“ (outdated, ~ 2005):

Key National Indicators Initiative, by Groves and Lyberg (2010)

To reflect on all possible factors influencing the quality of a survey, there is also the concept of the Total Survey Error (see Biemer et al. 2017; Groves and Lyberg 2010).

assumptions 1, 2 parameter logistic models (also called Rasch, Birnbaum model)

Before running any statistical model, normally - if model is sensitive to the violation of a specific assumption - the assumptions of the models are tested. For example, the t-test is robust to violations of non-normality, so pragmatically it is not necessary to test this assumption.

For an important “Item Response Theory” (IRT) model, the Rasch model (1PL), we have the following assumptions (from my master thesis):

  • Unidimensionality: The probability of a correct response depends only on a single latent trait and is determined by the model parameters \theta_v, \beta_i. Besides the model parameters, there are no other influencing variables \varphi: P(X_{vi} = 1 \mid \theta_v, \beta_i, \varphi) = (X_{vi} = 1 \mid \theta_v, \beta_i)
  • Local stochastic independence: When the person ability \theta_v is held constant at a specific value, the correlation between any possible item pair X_{vi}, X_{vj} in the test disappears (where i \neq j): P(X_{vi} \perp X_{vj} \mid \theta_v), \forall ,, i, j
  • Sufficiency of sum scores: The sum scores R_v = \sum_{i=1}^{k} X_{vi} of a test with length k are sufficient for estimating a person’s ability \theta_v. The same applies analogously to the item scores C_i = \sum_{v=1}^{n} X_{vi}
  • Monotonicity: The probability of a correct response to an item x_{vi} increases monotonically with higher values of person ability \theta. The more able a person is, the more likely they are to answer an item correctly. This is expressed in the ICC f(x_{vi} \mid \theta_v, \beta_i) as follows: \theta_v > \theta_w: f(x_{vi} \mid \theta_v, \beta_i) > f(x_{wi} \mid \theta_w, \beta_i), \forall ,, \theta_v, \theta_w

In the 2PL model, each item has additionally its own discrimination parameter, allowing some items to be better at differentiating between individuals with slightly different abilities.


If now the 1PL or 2PL is not fitting, this could have multiple reasons:

  • Multidimensionality, items are influenced by more than one latent trait
  • Poorly Fitting Items or Data, some items behave in an unexpected or erratic manner, it may not exhibit the expected increasing probability of a correct response as ability increases, especially if the item discrimnation parameter \alpha_i is estimated negative (Monotonicity violated)
  • Ceiling or Floor Effects, test contains items that are either too easy or too difficult for the population being measured, the probability of a correct response might not vary meaningfully with ability over a range of \theta values.

To solve these problems, normally:

  1. assumptions of the models are tested (e.g., Exploratory Factor Analysis to test for uni-dimensionality)
  2. if necessary data is accounted by more complex statistical models (e.g., computing Multidimensional Rasch Models)

in the context of survey data these are models normally from the model family of the “Latent Variable Models” (see above)

Test reliability of pre, post measures

Correction for Attenuation theory

see YouTube video: https://www.youtube.com/watch?v=jJ-qLImQYZs

attenuation formula developed by Charles Spearman, based on “Classical Test Theory”

Corr(\tau_1, \tau_2) = \frac{Corr(Y_1, Y_2)}{\sqrt(Rel(Y_1)* Rel(Y_2))}

Rule of thumb: if reliability of measures Y_1, Y_2 is low, there is shrinkage, which leads to an underestimation of the true correlation

just imagine larger values in the denominator

\sim reliable items:

r12 = .174 # corr in numerator
re1 = .81 # reliability 1 in denominator
re2 = .88 # "

r12 / sqrt(re1 * re2) # plug in in formula
[1] 0.206094

non-reliable items:

r12 = .174 # corr in numerator
re1 = .65 # reliability 1 in denominator
re2 = .59 # "

r12 / sqrt(re1 * re2) # plug in in formula
[1] 0.2809743

only if items are perfect reliable there is no attenuation:

! not a plausible assumption, see classical test theory, which is basically a measurement theory

r12 = .174 # corr in numerator
re1 = 1 # reliability 1 in denominator
re2 = 1 # "

r12 / sqrt(re1 * re2) # plug in in formula
[1] 0.174

so we usually use latent variables

we compute a “Confirmatory Factor Analysis” (CFA) containing the measurement models of the measures Y_1, Y_2, whereby these “latent” correlation is corrected for the un-reliability of the measures

you always get higher correlations, or in context of regression models higher coefficients, thereby reducing Type II errors, …

Example of CFA first order

simulation study

using the following R package: https://cran.r-project.org/web/packages/faux/

if you want to change the corr matrix, just ask ChatGPT: https://chatgpt.com/share/6718a0ff-05f4-8007-b424-4af35980fc49

set.seed(123)
# Load necessary libraries
library(faux)
Warning: Paket 'faux' wurde unter R Version 4.3.3 erstellt

************
Welcome to faux. For support and examples visit:
https://debruine.github.io/faux/
- Get and set global package options with: faux_options()
************
library(psych)
Warning: Paket 'psych' wurde unter R Version 4.3.3 erstellt
# Set up a correlation matrix
# We have two measurement models with three items each, 
# with correlations of 0.7 within each measurement model, 
# and correlations of 0.3 between the two models.

cor_matrix <- matrix(c(
  1, 0.7, 0.7, 0.3, 0.3, 0.3,  # Items A, B, C with D, E, F
  0.7, 1, 0.7, 0.3, 0.3, 0.3,  # Same structure
  0.7, 0.7, 1, 0.3, 0.3, 0.3,
  0.3, 0.3, 0.3, 1, 0.7, 0.7,
  0.3, 0.3, 0.3, 0.7, 1, 0.7,
  0.3, 0.3, 0.3, 0.7, 0.7, 1
), nrow=6, byrow=TRUE)

# Generate data with the specified correlation structure
data <- rnorm_multi(
  n = 300, 
  mu = rep(3, 6),  # Set mean of 3 for all items
  sd = rep(1, 6),  # Set standard deviation of 1 for all items
  r = cor_matrix   # Use the defined correlation matrix
)

# Assign column names to represent items
colnames(data) <- c("A", "B", "C", "D", "E", "F")

# Check the correlation matrix
# cor(data)

# Visualize the correlation matrix
psych::cor.plot(cor(data))

Latent variable approach:

true correlation corrected for un-reliability is around .393

# Load necessary libraries
library(lavaan)
Warning: Paket 'lavaan' wurde unter R Version 4.3.3 erstellt
This is lavaan 0.6-17
lavaan is FREE software! Please report any bugs.

Attache Paket: 'lavaan'
Das folgende Objekt ist maskiert 'package:psych':

    cor2cov
library(semPlot)
# fit model
myModel <- '
f1 =~ A + B + C
f2 =~ D + E + F
'
fit <- cfa(myModel, data=data)
summary(fit, standardized = TRUE)
lavaan 0.6.17 ended normally after 26 iterations

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                        13

  Number of observations                           300

Model Test User Model:
                                                      
  Test statistic                                 7.458
  Degrees of freedom                                 8
  P-value (Chi-square)                           0.488

Parameter Estimates:

  Standard errors                             Standard
  Information                                 Expected
  Information saturated (h1) model          Structured

Latent Variables:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
  f1 =~                                                                 
    A                 1.000                               0.738    0.806
    B                 0.989    0.074   13.362    0.000    0.731    0.773
    C                 1.126    0.080   14.005    0.000    0.831    0.841
  f2 =~                                                                 
    D                 1.000                               0.784    0.831
    E                 1.121    0.070   16.105    0.000    0.879    0.865
    F                 1.032    0.067   15.402    0.000    0.809    0.814

Covariances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
  f1 ~~                                                                 
    f2                0.227    0.043    5.304    0.000    0.393    0.393

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
   .A                 0.293    0.037    7.942    0.000    0.293    0.350
   .B                 0.359    0.040    8.889    0.000    0.359    0.402
   .C                 0.286    0.042    6.738    0.000    0.286    0.292
   .D                 0.276    0.034    8.082    0.000    0.276    0.309
   .E                 0.261    0.038    6.787    0.000    0.261    0.252
   .F                 0.333    0.039    8.632    0.000    0.333    0.337
    f1                0.545    0.070    7.785    0.000    1.000    1.000
    f2                0.615    0.074    8.340    0.000    1.000    1.000
semPlot::semPaths(object = fit, what = "std")

Manifest variable approach (simple computing mean scores):

not corrected for unreliability of measurments \rightarrow is lower

Y1 <- rowMeans(data[,1:3])
Y2 <- rowMeans(data[,4:6])
cor(Y1, Y2)
[1] 0.336648

check reliability of measures:

# average inter-item correlation of first factor:
mean(colMeans(x = cor(data[,1:3])))
[1] 0.766947
# compute Cronbachs Alpha
re_factor1 <- psych::alpha(subset(data, select = c(A, B, C)))
Number of categories should be increased  in order to count frequencies. 
re_factor1$total
 raw_alpha std.alpha   G6(smc) average_r      S/N        ase     mean       sd
 0.8478307 0.8480644 0.7895914 0.6504204 5.581738 0.01516258 2.969903 0.833271
  median_r
 0.6569992
re_factor2 <- psych::alpha(subset(data, select = c(D, E, F)))
Number of categories should be increased  in order to count frequencies. 
re_factor2$total
 raw_alpha std.alpha   G6(smc) average_r      S/N        ase     mean        sd
 0.8745568 0.8750114 0.8240772 0.7000219 7.000729 0.01250763 2.979012 0.8823433
  median_r
 0.7000894

Random stuff

cite literature in Quarto (rmarkdown)

  1. Blah blah (see Yarkoni and Westfall 2017, 33–35; also Speck et al. 2017, ch. 1).
  2. Blah blah (Yarkoni and Westfall 2017, 33–35).
  3. Blah blah (Yarkoni and Westfall 2017; Speck et al. 2017).
  4. Rutkowski et al. says blah (2017).
  5. Yarkoni and Westfall (2017) says blah.

References

Biemer, Paul P., Edith D. de Leeuw, Stephanie Eckman, Brad Edwards, Frauke Kreuter, Lars E. Lyberg, N. Clyde Tucker, and Brady T. West. 2017. Total Survey Error in Practice. John Wiley & Sons.
Bollen, Kenneth A. 2002. “Latent Variables in Psychology and the Social Sciences.” Annual Review of Psychology 53 (January): 605–34. https://doi.org/10.1146/annurev.psych.53.100901.135239.
Borsboom, Denny. 2008. “Latent Variable Theory.” Measurement: Interdisciplinary Research and Perspectives 6 (1-2): 25–53. https://doi.org/10.1080/15366360802035497.
Cronbach, Lee J., and Paul E. Meehl. 1955. “Construct Validity in Psychological Tests.” Psychological Bulletin 52 (4): 281–302. https://doi.org/10.1037/h0040957.
Groves, Robert M., and Lars Lyberg. 2010. “Total Survey Error: Past, Present, and Future.” Public Opinion Quarterly 74 (5): 849–79. https://doi.org/10.1093/poq/nfq065.
Moosbrugger, Helfried, and Augustin Kelava, eds. 2020. Testtheorie und Fragebogenkonstruktion. Berlin, Heidelberg: Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-662-61532-4.
Peng, Roger D., and Elizabeth Matsui. 2016. The Art of Data Science: A Guide for Anyone Who Works with Data. Lulu.com.
Schurig, Michael. 2017. Latente Variablenmodelle in der empirischen Bildungsforschung - die Schärfe und Struktur der Schatten an der Wand.” PhD thesis, TU Dortmund.
Skrondal, Anders, and Sophia Rabe-Hesketh. 2004. Generalized Latent Variable Modeling: Multilevel, Longitudinal, and Structural Equation Models. New York: Chapman and Hall/CRC. https://doi.org/10.1201/9780203489437.
———. 2007. “Latent Variable Modelling: A Survey*.” Scandinavian Journal of Statistics 34 (4): 712–45. https://doi.org/10.1111/j.1467-9469.2007.00573.x.
Speck, Olga, David Speck, Rafael Horn, Johannes Gantner, and Klaus Peter Sedlbauer. 2017. “Biomimetic Bio-Inspired Biomorph Sustainable? An Attempt to Classify and Clarify Biology-Derived Technical Developments.” Bioinspiration & Biomimetics 12 (1): 011004. https://doi.org/10.1088/1748-3190/12/1/011004.
Tourangeau, Roger, Lance J. Rips, and Kenneth Rasinski. 2000. The Psychology of Survey Response. Cambridge University Press.
Yarkoni, Tal, and Jacob Westfall. 2017. “Choosing Prediction Over Explanation in Psychology: Lessons From Machine Learning.” Perspectives on Psychological Science 12 (6): 1100–1122. https://doi.org/10.1177/1745691617693393.