Setup

Packages

library(pacman)
p_load(lavaan, dplyr, knitr)

Data Input

Names = list("I",   "S",    "A",    "V",    "C",    "DS",   "PC",   "PA",   "BD",   "OA",   "CO",   "HM",   "GC",   "NR",   "T",    "WO",   "MA",   "MS",   "PS",   "FP",   "AR",   "R",    "RD",   "RU")

lowerw = '
1                                                                                           
0.6 1                                                                                       
0.37    0.37    1                                                                                   
0.71    0.69    0.38    1                                                                               
0.59    0.51    0.25    0.63    1                                                                           
0.2 0.23    0.52    0.11    0.1 1                                                                       
0.29    0.38    0.09    0.28    0.34    0.14    1                                                                   
0.24    0.17    0.23    0.28    0.11    0.14    0.18    1                                                               
0.38    0.41    0.34    0.56    0.23    0.25    0.43    0.4 1                                                           
0.29    0.37    0.24    0.45    0.24    0.23    0.44    0.29    0.55    1                                                       
0.03    0.15    0.1 0.05    0.05    0.14    0.15    0.22    0.16    0.13    1                                                   
0.19    0.36    0.45    0.26    0.31    0.4 0.16    0.05    0.21    0.25    0.13    1                                               
0.39    0.31    0   0.28    0.3 0.11    0.37    0.13    0.36    0.26    0.14    0   1                                           
0.12    0.12    0.29    0.13    0.11    0.63    0.13    0.14    0.3 0.18    0   0.35    -0.08   1                                       
0.31    0.29    0.44    0.37    0.21    0.26    0.4 0.37    0.675   0.53    0.27    0.31    0.07    0.25    1                                   
0.11    0.18    0.39    0.12    0.17    0.64    0.19    0.04    0.29    0.21    0.22    0.36    -0.06   0.59    0.31    1                               
0.23    0.36    0.39    0.41    0.38    0.26    0.36    0.32    0.48    0.4 0.24    0.39    0.17    0.21    0.44    0.25    1                           
0.02    0.14    0.09    0.07    0.09    0.14    0.29    0.2 0.28    0.32    0.45    0.3 0.17    0.09    0.39    0.25    0.29    1                       
0.31    0.4 0.32    0.3 0.31    0.09    0.17    0.1 0.29    0.32    0.21    0.33    0.32    -0.02   0.31    0.1 0.14    0.29    1                   
0.72    0.68    0.2 0.75    0.54    0.09    0.34    0.2 0.4 0.35    -0.09   0.24    0.35    0.08    0.24    0.07    0.26    -0.03   0.32    1               
0.49    0.44    0.57    0.54    0.49    0.43    0.15    0.24    0.44    0.24    0.11    0.46    0.13    0.34    0.35    0.35    0.42    0.29    0.33    0.41    1           
0.66    0.64    0.33    0.73    0.49    0.15    0.38    0.24    0.46    0.45    0   0.34    0.39    0.1 0.35    0.14    0.33    0.12    0.42    0.7 0.48    1       
0.53    0.5 0.41    0.63    0.48    0.41    0.35    0.14    0.42    0.3 -0.04   0.4 0.12    0.35    0.3 0.29    0.51    0.1 0.17    0.53    0.65    0.55    1   
0.55    0.51    0.39    0.59    0.46    0.27    0.23    0.18    0.33    0.29    0.1 0.34    0.09    0.16    0.36    0.25    0.37    0.14    0.3 0.47    0.53    0.6 0.68    1'

lowerb = '
1                                                                                           
0.71    1                                                                                       
0.48    0.46    1                                                                                   
0.7 0.65    0.44    1                                                                               
0.68    0.67    0.53    0.76    1                                                                           
0.11    0.19    0.15    0.25    0.23    1                                                                       
0.39    0.43    0.23    0.26    0.39    0.1 1                                                                   
0.36    0.41    0.3 0.33    0.39    0.17    0.39    1                                                               
0.45    0.56    0.44    0.43    0.49    0.22    0.58    0.44    1                                                           
0.2 0.25    0.06    0.16    0.19    0.12    0.44    0.37    0.51    1                                                       
0.11    0.04    0.21    0.13    0.18    0.24    0.09    0.21    0.25    0.27    1                                                   
0.15    0.18    0.27    0.11    0.21    0.38    0.16    0.23    0.29    0.15    0.26    1                                               
0.25    0.33    0.18    0.26    0.27    0.18    0.41    0.36    0.42    0.35    0.1 0.13    1                                           
0.16    0.17    0.19    0.27    0.23    0.52    0.14    0.06    0.23    0.01    0.08    0.19    0.06    1                                       
0.27    0.4 0.27    0.31    0.33    0.19    0.51    0.39    0.66    0.53    0.12    0.19    0.5 0.22    1                                   
0.24    0.2 0.25    0.1 0.28    0.29    0.25    0.23    0.18    0.08    0.12    0.26    0.08    0.4 0.1 1                               
0.44    0.48    0.35    0.43    0.44    0.33    0.45    0.26    0.59    0.38    0.22    0.26    0.34    0.24    0.41    0.32    1                           
0.26    0.26    0.18    0.22    0.28    0.31    0.33    0.35    0.44    0.48    0.3 0.22    0.27    0.18    0.48    0.3 0.41    1                       
0.19    0.18    0.2 0.14    0.26    0.12    0.38    0.44    0.45    0.35    0.39    0.28    0.21    0.09    0.28    0.11    0.35    0.38    1                   
0.66    0.53    0.25    0.69    0.55    0.18    0.35    0.34    0.38    0.27    0.09    0.13    0.29    0.09    0.26    0.09    0.35    0.24    0.12    1               
0.44    0.45    0.59    0.47    0.57    0.14    0.27    0.26    0.51    0.06    0.07    0.28    0.11    0.18    0.3 0.23    0.41    0.12    0.21    0.29    1           
0.73    0.64    0.43    0.69    0.67    0.24    0.4 0.58    0.55    0.28    0.17    0.2 0.37    0.26    0.39    0.18    0.47    0.3 0.31    0.69    0.48    1       
0.55    0.43    0.35    0.6 0.48    0.3 0.17    0.1 0.34    0.11    0.12    0.16    0.22    0.23    0.12    0.12    0.52    0.24    0.1 0.59    0.39    0.52    1   
0.59    0.56    0.41    0.67    0.61    0.15    0.27    0.27    0.39    0.21    0.13    0.25    0.33    0.24    0.23    0.13    0.46    0.21    0.19    0.67    0.45    0.63    0.73    1'

NJW.cor = getCov(lowerw, names = Names)
NJB.cor = getCov(lowerb, names = Names)
NJWSDs <- c(2.11, 2.91, 2.36, 2.4, 2.82, 2.9, 2.53, 2.12, 2.71, 3.02, 2.77, 2.34, 2.72, 2.52, 2.69, 2.12, 2.76, 2.51, 2.04, 12.55, 12.17, 11.08, 12.69, 9.02)
NJBSDs <- c(2.47, 2.66, 2.4, 2.49, 2.41, 2.72, 2.38, 2.22, 3.16, 2.98, 2.39, 2.24, 2.95, 2.47, 2.32, 1.67, 2.32, 2.43, 2.21, 12.44, 9.4, 11.42, 11.05, 8.45)
NJW.cov = lavaan::cor2cov(R = NJW.cor, sds = NJWSDs)
NJB.cov = lavaan::cor2cov(R = NJB.cor, sds = NJBSDs)

Wmeans = c(9.83, 10.73, 9.53, 10.35, 10.29, 8.93, 10.09, 11.38, 10.13, 9.98, 10.37, 9.09, 10.07, 9.84, 10.56, 9.19, 9.53, 9.53, 10.07, 99.63, 99.27, 99.8, 100.8, 96.66)
Bmeans = c(8.58, 8.87, 8.51, 9.05, 9.05, 8.52, 9.08, 10.55, 7.78, 8, 10.2, 8.1, 9.4, 9.79, 9.03, 8.87, 9.03, 8.42, 9.55, 96.06, 90.53, 91.95, 95.5, 92.74)

NJCovs <- list(NJW.cov, NJB.cov)
NJMeans <- list(Wmeans, Bmeans)
NJNs <- list(86, 86)

FITM <- c("chisq", "df", "nPar", "cfi", "rmsea", "rmsea.ci.lower", "rmsea.ci.upper", "aic", "bic")

Rationale

People often make statements to the effect that “generally lower-scoring group X is catching up on generally higher-scoring group Y” in terms of a measure of some psychological construct. These statements are often made on the basis of observed rather than latent scores. I contend that this leads to improper inferences regarding the state of gaps, as observed scores are not entirely derived from the constructs that actually interest researchers. For example, if two groups differed in their mean levels of neuroticism and this explained a disparity in depression symptoms exhibited by those groups, it would do no good to teach the more neurotic group which answers to bubble in order to obtain observed scores which would normally evince lower neuroticism as, by doing this, they’ve only changed the observed score and left the trait unaltered, making the groups psychometrically incomparable in terms of their responses barring an effect of the observed score itself on the trait. This is similar to equating groups in terms of height by asking the taller group to sit down: it does not address real differences, only ones which exist in that situation.

In order to make this clear for psychological constructs, I show, below, that the mean levels of general intelligence (g) measured by two different assessments - which show different observed score gaps - are virtually identical despite the differences in observed test outcomes. This dataset comes from Naglieri & Jensen (1987) and was previously found to yield strict factorial invariance by Dolan & Hamaker (2001; see also https://rpubs.com/JLLJ/SH where I’ve output a similar result). The samples are composed of fourth- and fifth-grade students, equally-sized, and matched in terms of age, sex, school, and socioeconomic status, which reduces the differences from their typical level (about 1 Hedge’s g) by a modest amount (as reported by Jensen, 1998, this is expected to reduce them by \(\frac{1}{3}\); see also Kane & Oakland, 2010, p. 328; notably, it would be irresponsible to assume that this reduction is evidence of a causal effect of the matching variables since they reflect both independent effects and omitted sources of variance, and, in fact, the gap at any particular level tends to increase up the level of, e.g., socioeconomic status, remain unchanged with age, and to be consistent by sex barring a focus on strongly sex-differentiated group factors like spatial and mechanical ability/skill).

Analysis

Do They Measure the Same g?

Before showing that the groups have the same gaps in a latent construct measured by different tests, it should be confirmed that the constructs are identically measured by the tests. I’ve performed EFAs of both the WISC-R and the K-ABC separately elsewhere (to be included in the supplement of a forthcoming as of May 13, 2020 paper) and the following confirmatory models are based on them. Because this test is measurement invariant for the groups, I use the data from the White group for this part of the analysis.

#WISC-R Model

HOFWHISKER.model <- '
VIQ =~ I + S + V + C
PIQ =~ PC + PA + BD + OA + CO
FD =~ A + DS

gWI =~ VIQ + PIQ + FD'

BFWHISKER.model <- '
VIQ =~ I + S + V + C
PIQ =~ PC + PA + BD + OA + CO
A ~~ DS

gWI =~ I + S + V + C + PC + PA + BD + OA + CO + A + DS
'

HOFWHISKER.fit <- cfa(HOFWHISKER.model, sample.cov = NJW.cov, sample.nobs = 86, std.lv = T, orthogonal = T)
BFWHISKER.fit <- cfa(BFWHISKER.model, sample.cov = NJW.cov, sample.nobs = 86, std.lv = T, orthogonal = T)

round(cbind(HOF = fitMeasures(HOFWHISKER.fit, FITM),
            BF = fitMeasures(BFWHISKER.fit, FITM)),3)

##                     HOF       BF
## chisq            45.557   37.018
## df               41.000   34.000
## npar             25.000   32.000
## cfi               0.985    0.990
## rmsea             0.036    0.032
## rmsea.ci.lower    0.000    0.000
## rmsea.ci.upper    0.085    0.087
## aic            4203.526 4208.986
## bic            4264.885 4287.526

#K-ABC Model

HOFKABC.model <- '
Gfle =~ MA + AR + RD + RU
Gv =~ GC + PS + FP + R
Gsm =~ 1*NR + 1*WO 
Gmot =~ HM + T + MA + MS 

gKA =~ Gfle + Gv + Gsm + Gmot' #NR + WO set homogeneous to identify; doesn't meaningfully affect fit, which is similar to the model fitted by Dolan & Hamaker anyway.

BFKABC.model <- '
Gfle =~ MA + AR + RD + RU
Gv =~ GC + PS + FP + R
NR ~~ WO
Gmot =~ HM + T + MA + MS 

gKA =~ MA + AR + RD + RU + GC + PS + FP + R + NR + WO + HM + T + MS
RD ~~ 0*RD' #required because of RD/RU collinearity - not an issue

HOFKABC.fit <- cfa(HOFKABC.model, sample.cov = NJW.cov, sample.nobs = 86, std.lv = T, orthogonal = T)
BFKABC.fit <- cfa(BFKABC.model, sample.cov = NJW.cov, sample.nobs = 86, std.lv = T, orthogonal = T)

round(cbind(HOF = fitMeasures(HOFKABC.fit, FITM),
            BF = fitMeasures(BFKABC.fit, FITM)),3)

##                     HOF       BF
## chisq            97.334   69.020
## df               62.000   53.000
## npar             29.000   38.000
## cfi               0.907    0.958
## rmsea             0.081    0.059
## rmsea.ci.lower    0.048    0.000
## rmsea.ci.upper    0.111    0.096
## aic            6182.016 6171.701
## bic            6253.192 6264.966

In both cases, a higher-order model works better as a description of the data based on BIC. The argument for a higher-order model for comparing factors is principally based on the fact that they allow for a stronger control of the sampling error resulting from the selection of tests in a battery being modeled. For that reason, I’ll assess the relationship between higher-order g factors, but the relationship is virtually identical for the bifactor model (which fits better in terms of centrality measures, absolute fit, but is obviously less parsimonious and atheoretical); if anyone wants to mix-and-match, that’s up to them to both do and justify (which is easy in the first case, and varies in difficulty in the latter).

The steps in comparing these factors are to first, assess a model with no relationship between the g factors, followed by a model with a relationship, then a model where relevant group factors are allowed to covary, followed by a model in which large indicator-level residual covariances are allowed, and finally, a model in which the g factors are constrained to be equal. The covariances are only applied to assess the “true” extent of the relationship between tests and to assess whether it’s mediated by just g, whether that relationship is excessively high due to related group factors and indicators, etc.; others performing the same analysis have explained this. To save space, the first few models are not illustrated but their fits are included; the final model modifications are included and it’s noted that the combined model is just the configurations of the above models.

#How related are the g factors?

round(cbind(NOREL = fitMeasures(HOFNO.fit, FITM),
            FREEREL = fitMeasures(HOFSOME.fit, FITM),
            GCOVS = fitMeasures(HOFGCOV.fit, FITM),
            RECOVS = fitMeasures(HOFRECOV.fit, FITM),
            SAME = fitMeasures(HOFSAME.fit, FITM)),3)

##                    NOREL   FREEREL     GCOVS    RECOVS      SAME
## chisq            600.813   477.948   388.882   318.493   319.152
## df               246.000   245.000   244.000   232.000   233.000
## npar              54.000    55.000    56.000    68.000    67.000
## cfi                0.647     0.768     0.856     0.914     0.914
## rmsea              0.130     0.105     0.083     0.066     0.066
## rmsea.ci.lower     0.116     0.091     0.067     0.047     0.046
## rmsea.ci.upper     0.143     0.119     0.098     0.083     0.083
## aic            10385.542 10264.677 10177.611 10131.222 10129.881
## bic            10518.076 10399.667 10315.054 10298.118 10294.323

parameterEstimates(HOFRECOV.fit, stand = T) %>% 
  filter(op == "~~") %>% 
  select(Variable = lhs, Target = rhs, SE = se, Z = z, 'p-value' = pvalue, "Standardized Variance/Covariance" = std.all) %>% 
  kable(digits = 3, format = "pandoc")

Variable	Target	SE	Z	p-value	Standardized Variance/Covariance
gWI	gKA	0.020	49.966	0.000	1.018
VIQ	Gv	0.000	NA	NA	1.000
FD	Gsm	0.156	6.419	0.000	0.999
PIQ	Gmot	0.000	NA	NA	1.000
A	AR	1.791	3.137	0.002	0.345
CO	RD	2.067	-2.427	0.015	-0.330
CO	MS	0.634	2.944	0.003	0.323
PC	GC	0.575	2.600	0.009	0.284
DS	GC	0.521	2.935	0.003	0.535
V	PS	0.244	-2.402	0.016	-0.289
A	PS	0.374	2.959	0.003	0.303
BD	HM	0.439	-2.532	0.011	-0.334
I	MA	0.301	-2.221	0.026	-0.254
C	MA	0.489	2.022	0.043	0.227
A	MA	0.396	2.170	0.030	0.206
A	MS	0.419	-2.520	0.012	-0.243
I	I	0.266	5.872	0.000	0.362
S	S	0.558	5.980	0.000	0.399
V	V	0.246	4.908	0.000	0.212
C	C	0.688	6.226	0.000	0.534
PC	PC	0.729	6.282	0.000	0.728
PA	PA	0.568	6.394	0.000	0.817
BD	BD	0.533	4.715	0.000	0.340
OA	OA	0.863	6.003	0.000	0.575
CO	CO	1.054	6.554	0.000	0.943
A	A	0.613	6.372	0.000	0.743
DS	DS	1.022	1.320	0.187	0.165
MA	MA	0.712	6.259	0.000	0.609
AR	AR	11.804	5.762	0.000	0.501
RD	RD	9.759	3.430	0.001	0.210
RU	RU	6.373	5.573	0.000	0.442
GC	GC	0.934	6.491	0.000	0.790
PS	PS	0.526	6.467	0.000	0.856
FP	FP	8.383	5.170	0.000	0.280
R	R	7.264	5.513	0.000	0.331
NR	NR	0.591	5.949	0.000	0.663
WO	WO	0.424	5.402	0.000	0.562
HM	HM	0.710	6.220	0.000	0.817
T	T	0.568	4.173	0.000	0.331
MS	MS	0.770	6.287	0.000	0.753
VIQ	VIQ	0.000	NA	NA	0.361
PIQ	PIQ	0.000	NA	NA	0.462
FD	FD	0.000	NA	NA	0.675
gWI	gWI	0.000	NA	NA	1.000
Gfle	Gfle	0.000	NA	NA	0.119
Gv	Gv	0.000	NA	NA	0.483
Gsm	Gsm	0.000	NA	NA	0.560
Gmot	Gmot	0.000	NA	NA	0.641
gKA	gKA	0.000	NA	NA	1.000

The model fits similarly to the Dolan & Hamaker model (its small size can explain its bad initial fit relative to other tests); of course, a bifactor model in which the batteries are modeled as one with confirmatory fit derived from the same EFA fits much better and sees congruent g loadings and thus more evidence that the tests measure the same g factor. Regardless, the case that the g factors measured by the WISC-R and K-ABC are identical/interchangeable is tenable and as such it is reasonable to expect that the g factors should yield approximately (due to minor differences which could be accounted for in a combined model of the batteries) the same-sized gap.

How Large is the Gap? On Which Assessments?

As already mentioned, MGCFA of this battery has already been conducted. Modifications to increase initial fit or whatnot are possible and people can do as they wish and, within reason, it won’t alter the substantive result (unless using the bifactor model, which is obviously contaminated by the subtest sampling variance which makes the observed total scores differ in the first place), but for this analysis, all I’ll be assessing is whether the g factors produced in the simplest between-group models of the separate WISC-R and K-ABC batteries produce the same mean difference despite dissimilar observed differences. As such, below, are the intercepts from a model in which loadings, intercepts, residuals, and latent variances have been constrained to equality for the separate WISC-R and K-ABC higher-order models.

parameterEstimates(WHISKERDIFF.fit, stand = T) %>% 
  filter(op == "~1") %>% 
  select(Indicator = lhs, "Unstandardized Intercept" = est, SE = se, Z = z, 'p-value' = pvalue, "Standardized Intercept" = std.all) %>% 
  kable(digits = 3, format = "pandoc")

Indicator	Unstandardized Intercept	SE	Z	p-value	Standardized Intercept
I	9.839	0.229	43.015	0.000	4.369
S	10.565	0.280	37.670	0.000	3.790
V	10.408	0.247	42.152	0.000	4.372
C	10.345	0.254	40.677	0.000	4.033
PC	10.218	0.220	46.402	0.000	4.355
PA	11.447	0.191	59.822	0.000	5.447
BD	10.096	0.292	34.587	0.000	3.657
OA	9.852	0.280	35.172	0.000	3.369
CO	10.539	0.217	48.670	0.000	4.131
A	9.502	0.264	35.942	0.000	3.831
DS	9.045	0.256	35.303	0.000	3.201
VIQ	0.000	0.000	NA	NA	0.000
PIQ	0.000	0.000	NA	NA	0.000
FD	0.000	0.000	NA	NA	0.000
gWI	0.000	0.000	NA	NA	0.000
I	9.839	0.229	43.015	0.000	4.259
S	10.565	0.280	37.670	0.000	3.699
V	10.408	0.247	42.152	0.000	4.250
C	10.345	0.254	40.677	0.000	3.944
PC	10.218	0.220	46.402	0.000	4.123
PA	11.447	0.191	59.822	0.000	5.233
BD	10.096	0.292	34.587	0.000	3.243
OA	9.852	0.280	35.172	0.000	3.158
CO	10.539	0.217	48.670	0.000	4.099
A	9.502	0.264	35.942	0.000	4.191
DS	9.045	0.256	35.303	0.000	3.295
VIQ	-0.125	0.159	-0.787	0.431	-0.069
PIQ	-0.629	0.198	-3.172	0.002	-0.347
FD	0.006	0.196	0.032	0.975	0.005
gWI	-0.759	0.165	-4.590	0.000	-0.680

parameterEstimates(KABCDIFF.fit, stand = T) %>% 
  filter(op == "~1") %>% 
  select(Indicator = lhs, "Unstandardized Intercept" = est, SE = se, Z = z, 'p-value' = pvalue, "Standardized Intercept" = std.all) %>% 
  kable(digits = 3, format = "pandoc")

Indicator	Unstandardized Intercept	SE	Z	p-value	Standardized Intercept
MA	9.877	0.243	40.667	0.000	3.904
AR	97.336	1.135	85.743	0.000	8.336
RD	101.174	1.259	80.381	0.000	8.330
RU	96.890	0.920	105.359	0.000	10.857
GC	10.128	0.247	40.952	0.000	3.561
PS	10.090	0.184	54.789	0.000	4.731
FP	100.909	1.249	80.819	0.000	8.007
R	99.481	1.279	77.758	0.000	8.338
NR	9.932	0.220	45.235	0.000	4.241
WO	9.147	0.190	48.116	0.000	4.954
HM	9.016	0.217	41.529	0.000	3.872
T	10.458	0.267	39.127	0.000	4.010
MS	9.500	0.243	39.098	0.000	3.777
Gfle	0.000	0.000	NA	NA	0.000
Gv	0.000	0.000	NA	NA	0.000
Gsm	0.000	0.000	NA	NA	0.000
Gmot	0.000	0.000	NA	NA	0.000
gKA	0.000	0.000	NA	NA	0.000
MA	9.877	0.243	40.667	0.000	4.079
AR	97.336	1.135	85.743	0.000	8.779
RD	101.174	1.259	80.381	0.000	8.982
RU	96.890	0.920	105.359	0.000	11.681
GC	10.128	0.247	40.952	0.000	3.631
PS	10.090	0.184	54.789	0.000	4.815
FP	100.909	1.249	80.819	0.000	8.528
R	99.481	1.279	77.758	0.000	9.224
NR	9.932	0.220	45.235	0.000	4.265
WO	9.147	0.190	48.116	0.000	5.000
HM	9.016	0.217	41.529	0.000	4.002
T	10.458	0.267	39.127	0.000	4.291
MS	9.500	0.243	39.098	0.000	3.949
Gfle	-0.128	0.139	-0.925	0.355	-0.058
Gv	-0.289	0.168	-1.716	0.086	-0.195
Gsm	0.202	0.235	0.862	0.389	0.166
Gmot	-0.425	0.205	-2.077	0.038	-0.378
gKA	-0.594	0.153	-3.890	0.000	-0.650

As printed above, the higher-order g differences in the WISC-R and the K-ABC respectively amounted to 0.68 and 0.65 g for this sample; as Naglieri & Jensen reported, for the FSIQs (based on raw scores), the differences were 0.73 (0.77 for the FSIQ difference based on the standardized WISC-R scores) and 0.56 (there was no standardized score basis for the K-ABC, but it presumably wouldn’t change much if there was) g for the same batteries. Consistent with expectations, the differences were around \(\frac{1}{3}\) smaller than normal (though, note, Jensen, 1998 based part of this prediction - for other data - on these data). This result adds evidence to the claim - which bears constant repeating - that the interpretation of group differences based on observed scores should not occur without an analysis of their psychometric properties. If this cannot be done for some reason, then groups shouldn’t be compared without marking the limitation. Finally, there are instances where observed differences on different assessments are derived from bias, such as in comparisons of men and women on an automotive knowledge section of a test otherwise intended to measure spatial abilities, and there are also situations where bias reduces group differences, such as in the analysis by Cockroft et al. (2015; see https://rpubs.com/JLLJ/frontierssaukmgcfa); it is always advisable to attempt to understand and account for these scenarios with models rather than speculating based on observed scores.

Discussion

Stop interpreting observed gaps without an analysis of latent ones or some theory to explain why the observed gaps are meaningful. Similarly, avoid exploratory methods for understanding latent gaps: it is trivial to demonstrate that if, say, one group lagging in a general factor gains on a multidimensional test due to group factors, with exploratory methods, there can be an apparent gain in the general factor which is illusory. Psychometricians discussing group differences have a duty to account for these sorts of things using psychometrically appropriate methods like multi-group confirmatory factor analysis. This also stands as a proof that even with a construct “explaining” differences on one test, if it’s identically measured by another test, it need not explain the differences there in the same proportions; hence, for a culture-free interpretation of hypotheses like Spearman’s via the use of specific test modalities like elementary cognitive tasks, a test must be elaborated in which the hypothesis is tested for those specific tests, not inferred because one test confirming it is g saturated and they are as well.

References

Naglieri, J. A., & Jensen, A. R. (1987). Comparison of black-white differences on the WISC-R and the K-ABC: Spearman’s hypothesis. Intelligence, 11(1), 21-43. https://doi.org/10.1016/0160-2896(87)90024-9

Dolan, C. V., & Hamaker, E. L. (2001). Investigating Black-White differences in psychometric IQ: Multi-group confirmatory factor analyses of the WISC-R and K-ABC and a critique of the method of correlated vectors. In Advances in psychology research, Vol. 6. (pp. 31-59). Nova Science Publishers.

Jensen, A. R. (1998). The g Factor: The Science of Mental Ability. Praeger Publishers/Greenwood Publishing Group.

Kane, H. D., & Oakland, T. D. (2010). Group Differences in Cognitive Ability: A CHC Theory Framework. Mankind Quarterly, 50(4). http://mankindquarterly.org/archive/issue/50-4/4

Cockcroft, K., Alloway, T., Copello, E., & Milligan, R. (2015). A cross-cultural comparison between South African and British students on the Wechsler Adult Intelligence Scales Third Edition (WAIS-III). Frontiers in Psychology, 6. https://doi.org/10.3389/fpsyg.2015.00297

Note

Some of the advice offered here should be contradicted for other lines of research, including many types of research into adverse impact, which occurs solely on the basis of observed scores, or in the assessment of basic skills like algebra since, even if groups are truly matched on mathematical ability, it is still useful to know that one is lagging because its members haven’t been taught a skill required for the test. This advice is, thus, most relevant for the study of group differences and test bias themselves.

Observed Differences Need Not Reflect Latent Ones

Jay Lasker

May 13, 2020